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I.  Executive  Summary 

In  this  project  we  focused  on  achieving  the  highest  average  efficiency  for  mm-wave  transmitters  ever 
reported  in  silicon  ICs.  The  reason  for  the  focus  on  average  efficiency,  as  opposed  to  solely  on  peak 
efficiency,  is  that  it  is  average  efficiency  that  captures  most  completely  the  power  requirements  of  an  RF 
transmitter.  In  addition,  our  goal  of  achieving  2  GHz  signal  bandwidth  placed  extreme  demands  on 
lowering  the  power  consumption  of  the  digital  baseband  part  of  the  system,  which  is  responsible  for  high¬ 
speed  control  of  the  asymmetric  multilevel  outphasing  (AMO)  architecture  and  also  nonlinear 
predistortion  of  the  transmitter.  Finally,  the  foundation  of  the  entire  system  concept  are  the  high- 
efficiency,  high-power  PAs  that  work  at  mm-wave  frequencies. 

In  the  course  of  the  project,  we  have  achieved  several  significant  breakthroughs  as  documented  in 
summaries  below  and  attached  publications. 

Research  Highlights 

A  0.7W  Fully  Integrated  42GHz  Power  Amplifier  with  10%  PAE  in  0.13pm  SiGe  BiCMOS, 

IEEE  ISSCC  2013 


This  paper  presented  a  record  mm-wave  output  power  of 
0.7W  at  42  GHz.  The  architecture  consists  of  a  16-way,  zero- 
degree-combined  (ZDC)  PA.  Each  individual  PA  was  a  three 
stage  cascade  design  in  0.18pm  SiGe.  At  the  time  of 
publication,  it  was  approximately  3  times  higher  output 
power  than  any  previously  reported  PA  above  30  GHz.  Soon 
after  publication  another  ELASTx  team  published  a  PA  at  0.5 
W  in  CMOS.  To  date,  this  PA  is  still  the  highest  output  mm- 
wave  PA  on  silicon. 

This  work  is  significant  to  the  ELASTx  effort  in  that  it 
produced  the  highest  single  IC  power  of  any  performer  and 
pushed  the  SO  A  out  by  a  factor  of  3. 


A  W-band  21.1  dBm  Power  Amplifier  with  an  8-way  Zero-degree  Combiner  in  45  nm  SOI  CMOS 
IEEE  MTT-S,  2014 


This  paper  presented  a  record  mm-wave  PA  in  W-band, 
achieving  115  mW  of  peak  output  power  at  80  GHz.  The 
PA  uses  an  8-way  power  combined  PA  and  a  ZDC.  The 
PA  was  designed  for  outphasing  and  can  be  used  as  a  W- 
band  modulator  and  produce  output  from  0  to  115  mW. 
This  work  was  significant  for  the  ELASTx  effort  as  it 
was  not  only  a  record  output  power,  but  also  was  the  first 
outphasing  PA  at  W-band. 
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A  Compact,  High-gain  Q-Band  Stacked  Power  Amplifier  in  45nm  SOI  CMOS  With  19.2dBm  Psat 
and  19%  PAE 

IEEE  PAWR  Conference,  Jan  2015 


This  paper  presented  a  compact  4-stacked  PA  architecture.  The 
stacking  enabled  not  only  increase  in  output  power,  but  allowed 
matching  directly  to  50  Ohms.  The  eliminated  the  output 
matching  network,  resulting  in  a  significant  deacrease  in  die  area. 

The  active  die  area  was  only  0.09mm2  compared  to  the  SOA  of 
0. 16  and  0.30.  The  PA  achieved  19.2  dBm  Psat,  which  compared 
favorable  to  the  SOA  of  20.3  and  21.6  dBm,  respectively.  The 
PAE  was  also  comparible,  with  19%  compared  to  19.4%  and 
25.1%.  The  latter,  25%,  is  higher  as  that  work  did  not  include 
the  preamplifier  in  the  design  and  thus  had  a  lower  DC  power 
consumption. 

This  work  is  significant  to  the  ELASTx  effort  as  a  small  die  area 
enables  1)  large  number  of  combined  PAs  in  a  given  area  and  2)  Reduced  PA  pitch,  which  is  a  fundamental 
limit  on  combined  efficiency  using  the  ZDC  combiner. 


High-Throughput  Signal  Component  Separator  for  Asymmetric  Multi-Level  Outphasing  Power 
Amplifiers 

IEEE  Journal  Solid-State  Circuits,  February 


This  paper  presented  the  foundations  of  the 
energy-efficient,  high-throughput  baseband 
for  the  transmitters  in  this  project.  Besides 
energy-efficiency  and  high-throughput,  the 
high-precision  signal  component  separator 
(SCS)  was  achieved  using  fixed-point  piece- 
wise  linear  functional  approximation 
developed  to  improve  the  hardware  efficiency 
of  the  outphasing  signal  processing  functions. 

The  chip  was  fabricated  in  45  nm  SOI  CMOS 
process  and  the  SCS  consumes  an  active  area 
of  1.5  mm2.  The  new  algorithm  enables  the 
SCS  to  run  at  3.4  GSamples/s  producing  the 
phases  with  12-bit  accuracy.  Compared  to  traditional  low-throughput  AMO  SCS  implementations,  at  0.8 
GSamples/s  this  design  improves  the  area  efficiency  by  25x  and  the  energy- efficiency  by  2x. 

This  design  represents  the  fastest  high-precision  SCS  to  date  and  enables  a  new  class  of  high-throughput 
mm- wave  and  base  station  transmitters  that  can  operate  at  high  area,  energy  and  spectral  efficiency. 

Technology  T  ransition 

The  technologies  developed  in  this  project  were  successfully  transitioned  to  two  semiconductor  startups 
-  Eta  Devices  and  NanoSemi  Inc. 
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Figure  4:  SCS  chip  block  diagram 


Eta  Devices  designs  efficient  CMOS  power-amplifiers  for  mobile  basestation  infrastructure  and 
handsets. 

NanoSemi  Inc.  designs  linearization  solutions  for  analog  front-ends  such  as  basestation  and  handset 
transmitter  and  receiver  chains,  instrumentation  and  sensing  front-ends. 
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II.  Detailed  Technical  Information 

A.  Highlights  Technical  Details 

The  technical  details  for  the  highlights  presented  in  the  previous  section  are  shown  in  the  papers  below. 
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8.4  A  0.7W  Fully  Integrated  42GHz  Power  Amplifier  with 
10%  PAE  in  0.13^m  SiGe  BiCMOS 

Wei  Tai\  L.Richard  Carley1,  David  S.  Ricketts2 
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In  this  paper,  we  report  a  fully  integrated  power  amplifier  (PA)  architecture  that 
combines  the  power  of  16  on-chip  PAs  using  a  16-way  zero-degree  combiner  to 
achieve  an  output  power  of  0.7W  with  a  power-added  efficiency  (PAE)  of  10%  at 
42GHz  and  a  -3dB  bandwidth  of  9GHz.  This  is  2.6  times  more  output  power  than 
a  recently  reported  millimeter-Wave  (mm-Wave)  silicon-based  PA  [1].  The  cir¬ 
cuit  is  a  fully  integrated  mm-Wave  PA  achieving  a  leading  output  power 
approaching  1  Watt  in  a  silicon  process. 

To  date  only  a  few  published  mm-Wave  PAs  in  silicon  have  achieved  a  saturat¬ 
ed  output  power  of  more  than  20dBm  (lOOmW)  [2-6].  The  difficulty  in  achiev¬ 
ing  high  output  powers  at  mm-Wave  frequencies  lies  in  the  limited  output  power 
of  a  single  PA,  which  is  constrained  by  maximum  current  density,  breakdown 
voltage  and  parasitics  of  the  technology.  Moreover,  to  obtain  maximum  power 
from  a  single  PA,  its  output  impedance  must  be  lowered  due  to  the  relative  low 
breakdown  voltage  of  silicon  devices.  In  fully  integrated  designs,  this  low  imped¬ 
ance  must  then  be  impedance  transformed  to  the  desired  output  impedance 
through  a  lossy  on-chip  impedance  transformer,  reducing  the  overall  net  output 
power. 

Power  combining  multiple  PAs  is  a  common  approach  to  obtaining  larger  out¬ 
put  powers.  As  more  unit  PAs  are  combined,  the  required  impedance  transfor¬ 
mation  ratio  as  well  as  voltage  and  thermal  stress  on  each  unit  PA  is  relaxed. 
However,  the  insertion  loss  of  traditional  power  combiners,  e.g.  Wilkinson  com¬ 
biners  [3]  and  transformer-based  combiners  [4,5],  tends  to  scale  up  as  the  num¬ 
ber  of  combined  PAs  increases  due  to  increased  size  and  complexity  of  the  com¬ 
biners.  Part  of  the  additional  size  and  complexity  stems  from  maintaining  port- 
to-port  isolation  in  the  combiner,  particularly  with  Wilkinson  combiners.  If  we 
assume,  however,  that  all  combiner  inputs  have  zero-degree  phase  difference 
(i.e.  in-phase),  port  isolation  is  no  longer  a  major  constraint.  In  this  case,  the 
requirement  for  exact  quarter-wavelength  segments  in  the  Wilkinson  combiner 
is  removed  and  arbitrary  line  lengths  can  be  used  subject  only  to  layout  con¬ 
straints  and  impedance  transformation  requirements.  It  is  then  possible  to  sig¬ 
nificantly  reduce  the  combiner  size  and  insertion  loss  while  simultaneously 
achieving  the  desired  impedance  transformation.  In  this  work,  we  combine  the 
power  of  16  PAs  using  a  16-way  zero-degree  combiner  that  achieves  low  inser¬ 
tion  loss  and  wideband  impedance  transformation. 

The  combiner  is  developed  using  scalable  SPICE  transmission  line  models 
derived  from  EM  field  simulations.  The  simulated  insertion  loss  of  the  combiner 
is  less  than  0.5dB  at  45GHz.  It  also  performs  the  required  impedance  transfor¬ 
mation  and  presents  the  optimum  load  impedance  to  each  unit  PA,  whose  value 
was  obtained  from  load-pull  simulations  of  the  unit  PA  output  stage.  Since  a 
dedicated  impedance  transformation  network  is  no  longer  needed,  the  associat¬ 
ed  power  loss  is  avoided.  We  improve  the  isolation  between  input  ports  to  bet¬ 
ter  than  10dB  by  inserting  a  small  resistor  between  each  pair  of  adjacent  ports. 
This  level  of  isolation  is  sufficient  for  maintaining  PA  stability  under  amplitude 
and  phase  mismatch  caused  by  process  variation.  The  size  of  the  power  com¬ 
biner  is  1.53x0.7  mm2. 

Using  this  combiner,  a  fully  integrated  power  amplifier  was  designed  in  a 
0.13pm  SiGe  BiCMOS  technology.  Figure  8.4.1  shows  the  block  diagram  of  the 
PA  and  it  also  illustrates  the  layout  placement  of  the  power  combiner,  unit  PAs, 
and  the  2  input  power  dividers.  By  orienting  the  16  unit  PAs  into  2  columns  of  8 
PAs  facing  the  center,  the  area  and  insertion  loss  of  the  power  combiner  can  be 
minimized.  The  PA  is  also  configured  so  that  the  two  columns  of  unit  PAs  can 
receive  different  input  signal  phases,  allowing  the  outphasing  technique  to  be 
employed  for  output  power  back-off.  The  unit  PA  design  is  similar  to  that  report¬ 
ed  in  [7j.  As  shown  schematically  in  Fig.  8.4.2,  each  unit  PA  consists  of  two  cas- 
code  driver  stages  followed  by  a  common-emitter  output  stage.  SiGe  HBT 
devices  with  200GHz  cutoff  frequency  (ft)  are  used.  To  maximize  gain  and  out¬ 
put  power,  all  of  the  devices  are  laid  out  in  a  C-B-E-B-C  configuration  and  biased 
near  a  peak -ft  current  density.  A  “tapered”  metal  stack  profile  is  used  in  the  tran¬ 
sistor  fingers  to  reduce  side-wall  parasitic  capacitances  between  terminals  with¬ 
out  sacrificing  the  current-handling  capability.  To  ensure  that  the  PA  is  uncondi¬ 


tionally  stable,  small  resistors  are  added  serially  to  the  base  bias  feeds  to  sup¬ 
press  potential  resonances.  The  output  stage  of  the  PA  operates  at  a  2.4V  sup¬ 
ply,  whereas  the  two  cascode  driver  stages  use  a  4V  supply.  The  transistors 
operate  safely  under  these  supply  voltages  due  to  a  low  base  bias  impedance  of 
approximately  300Q  [5,7].  A  total  of  91  pF  of  decoupling  capacitance  per  unit  PA 
is  spread  across  the  supply  feed  lines  to  minimize  trace  inductance  and  further 
improve  PA  stability.  L-C-based  inter-stage  matching  is  used,  which  consists  of 
metal-insulator-metal  capacitors  and  T-line  inductors.  The  T-line  inductors  allow 
a  compact  unit  PA  layout,  and  their  quality  factor  exceeds  20  at  45GHz.  Spiral 
inductors  with  patterned  ground  shields  are  used  as  RF  choke  inductors  in  the 
bias  feeds.  The  power  divider  and  combiner  are  both  designed  using  a  conduc¬ 
tor-backed  coplanar  waveguide  structure  (Fig.  8.4.2).  All  of  the  inductors  and 
transmission  lines  are  implemented  on  the  4pm  thick  top  aluminum  layer  to 
achieve  high  quality  factor. 

The  PA  was  tested  on  a  probe  station  with  coaxial  RF  probes  and  a  pair  of  12- 
needle  DC  probes  for  supply  and  bias.  Figure  8.4.3  shows  the  small-signal  per¬ 
formance  of  the  PA.  The  measured  peak  gain  (S21)  is  18.5dB  and  centered  at 
43GHz,  which  is  slightly  shifted  in  frequency  from  simulation  most  likely  due  to 
inaccuracies  in  parasitics  extraction.  Input  return  loss  (|S11 1)  is  better  than  10dB 
from  42.5  to  49GHz.  The  measured  stability  factor  k  is  greater  than  1  across  the 
entire  frequency  range,  indicating  unconditional  stability.  The  large  signal  per¬ 
formance  of  the  PA  was  measured  using  2  frequency  quadruplers  followed  by 
highpass  filters  as  signal  sources.  Power  losses  of  all  external  components  were 
de-embedded  across  the  measurement  frequency  band,  and  the  output  power  of 
the  PA  was  measured  with  a  power  meter.  Figure  8.4.4(a)  shows  the  large  sig¬ 
nal  characteristics  of  the  PA  at  42GHz  versus  the  output  stage  supply  voltage 
while  the  driver  stage  supply  voltage  remains  at  4V.  At  2.4V,  a  saturated  output 
power  of  28.4dBm  (0.7W)  is  achieved  with  10%  PAE.  Although  output  power 
increased  to  28.7dBm  with  a  2.7V  supply,  it  degraded  slightly  over  time,  which 
could  indicate  over-stress  on  the  HBT  devices.  No  such  degradation  was 
observed  with  a  2.4V  supply.  A  large-signal  frequency  sweep  is  shown  in  Fig. 
8.4.4(b).  The  output  power  -3dB  bandwidth  of  the  PA  is  greater  than  9GHz, 
which  corresponds  to  a  21%  fractional  bandwidth. 

The  saturated  output  power  of  the  proposed  PA  is  compared  with  recently 
reported  mm-Wave  PAs  in  silicon-based  processes,  as  shown  in  Fig.  8.4.5.  The 
achieved  output  power  of  0.7W  is  2.6  times  the  output  power  of  state-of-the-art 
mm-Wave  PAs  (>  30GHz).  A  comparison  of  PA  metrics  from  recently  reported 
high-power  mm-Wave  PAs  in  silicon  is  summarized  in  Fig.  8.4.6.  This  work 
shows  a  PAE  and  gain  comparable  to  the  referenced  works  as  well  as  one  of  the 
highest  bandwidths  for  a  non-distributed  topology.  Figure  8.4.7  shows  the 
micrograph  of  the  PA  die,  which  occupies  an  area  of  3x1.85  mm2. 
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Figure  8.4.1:  Block  diagram  of  the  16-way  power-combined  PA. 
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Figure  8.4.2:  Schematic  of  the  unit  PA  cell.  3-D  views  of  on-chip  passive 
devices  are  also  shown. 


40  50 

Frequency  [GHz] 

Figure  8.4.3:  Simulated  and  measured  S-parameters. 
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Figure  8.4.4:  (a)  Saturated  output  power  and  PAE  versus  supply  voltage  of  the 
output  stage  (VCC1)  at  42GHz.  (b)  Saturated  output  power  and  PAE  versus 
frequency. 
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Figure  8.4.5:  Saturated  output  power  versus  frequency  of  recent  mm-Wave 
silicon  PAs. 


Figure  8.4.6:  Performance  comparison  of  mm-Wave  silicon  PAs  with  greater 
than  20dBm  output  power. 
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Figure  8.4.7:  Die  micrograph.  The  PA  occupies  5.55mm2  of  die  area. 
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A  W-band  21.1  dBm  Power  Amplifier  with  an 
8-way  Zero-degree  Combiner  in  45  nm  SOI  CMOS 
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Abstract — This  paper  presents  a  W-band  power  amplifier  (PA) 
in  45  nm  SOI  CMOS.  The  PA  incorporates  an  8-way  zero-degree 
combiner  to  efficiently  combine  8  parallel  PA  units,  each  of  which 
is  a  2-stage  cascode  PA.  At  80  GHz,  the  PA  achieves  a  saturated 
output  power  (Psat)  of  21.1  dBm,  10.1  dB  peak  gain,  5.2%  peak 
PAE,  and  12  GHz  3-dB  bandwidth,  and  it  consumes  1  mm2  of 
die  area.  The  Psat  of  21.1  dBm  is  the  highest  among  reported 
W-band  PAs  in  CMOS  technology. 

Index  Terms — Power  amplifiers,  power  combining,  millimeter- 
wave,  W-band. 


I.  Introduction 

MILLIMETER- WAVE  power  amplifiers  (PAs)  are  key 
components  in  applications  such  as  high  data-rate  com¬ 
munications,  imaging,  and  radars.  Thanks  to  the  continued 
scaling  of  CMOS  processes  and  circuit  technique  improve¬ 
ments,  CMOS  PAs  with  good  performance  began  to  emerge 
in  the  W-band  (75-110  GHz)  [1]— [6]  where  compound  semi¬ 
conductor  PAs  previously  dominated,  offering  an  increased 
level  of  integration  and  reduced  cost.  Nonetheless,  obtaining 
high  output  power  in  the  W-band  still  remains  challenging  due 
to  the  low  breakdown  voltage  of  the  silicon  devices  and  high 
passive  loss  at  these  frequencies.  To  date,  the  highest  output 
power  achieved  by  a  W-band  CMOS  PA  has  yet  to  surpass  20 
dBm  [3]. 

In  this  paper,  we  present  a  W-band  CMOS  PA  that  achieves 
21.1  dBm  saturated  output  power  (Psat)  at  80  GHz  with 
5.2%  peak  power-added  efficiency  (PAE).  The  high  Psat 
is  attributed  to  the  use  of  an  8 -way  zero-degree  combiner 
(ZDC)  that  combines  the  powers  of  eight  2-stage  cascode 
unit  PAs.  The  coplanar  waveguide  based  combiner  performs 
the  required  impedance  transformation  and  introduces  low 
insertion  loss,  both  of  which  are  essential  to  achieving  high 
output  power.  Fabricated  in  a  45  nm  silicon-on-insulator  (SOI) 
CMOS  process,  the  PA  demonstrated  a  peak  gain  of  10.1  dB 
and  a  3-dB  bandwidth  of  12  GHz,  while  occupying  a  die  area 
of  1  mm2. 

II.  Zero-Degree  Combiner  (ZDC)  Design 

To  achieve  high  output  power,  a  sufficiently  large  voltage 
or  current  swing  has  to  be  generated  at  the  drain  of  the  power 
device.  Due  to  the  low  breakdown  voltage  of  silicon  processes, 
a  large  voltage  swing  at  the  drain  is  typically  undesirable.  On 
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the  other  hand,  the  current  swing  is  limited  by  the  channel 
width  of  the  device,  which  cannot  be  arbitrarily  increased 
due  to  the  parasitics  that  deteriorate  gain  and  matching  at 
high  frequencies.  A  small  load  impedance  is  also  required  to 
increase  current  swing,  which  usually  needs  a  lossy  impedance 
transformer  to  transform  the  50  Cl  load  down  to  the  small 
impedance. 

Power  combining  is  a  technique  that  alleviates  the  afore¬ 
mentioned  trade-offs.  By  combining  N  unit  PAs  using  an 
7V-way  power  combiner,  the  required  current  swing  at  each 
unit  PA  is  reduced  by  a  factor  of  N,  as  is  the  impedance 
transformation  ratio  needed  at  the  output  of  each  unit  PA. 
In  particular,  a  ZDC  is  especially  suited  to  combine  a  large 
number  of  unit  PAs  (i.e.  N  >  4),  which  has  been  previously 
demonstrated  in  the  Q-  and  V-band  [7],  [8]. 

In  this  work,  a  W-band  8 -way  ZDC  is  designed  and  used 
in  the  PA  architecture  shown  in  Fig.  1.  The  8-way  combiner 
consists  of  3  levels  of  transmission  line  (t-line)  segments  (i.e. 
TL1,  TL2,  and  TL3).  By  carefully  selecting  the  characteristic 


VDD 


impedance  and  length  of  each  level  of  t-lines,  a  wide  range 
of  input  impedances  can  be  achieved.  Unlike  a  traditional 
Wilkinson  combiner  which  uses  isolation  resistors  and  quarter- 
wavelength  t-lines  to  achieve  port  isolation,  the  ZDC  operates 
with  the  assumption  that  all  input  signals  are  in-phase,  hence 
ideal  port  isolation  is  not  required  and  t-line  lengths  much 
shorter  than  a  quarter-wavelength  can  be  used,  minimizing 
insertion  loss. 

Figure  2  shows  a  simplified  layout  of  the  8 -way  ZDC  in  this 
work.  The  ZDC  consists  of  coplanar-waveguide  segments  that 
reside  on  a  2.2  fim  thick  top  aluminum  layer  (the  ground  plane 
is  shown  as  light-gray  in  Fig.  2).  Optimization  of  the  ZDC 
involves  achieving  the  optimum  input  impedance  with  a  given 
bandwidth  while  minimizing  insertion  loss.  The  optimum  input 
impedance  of  the  combiner  (i.e.  load  impedance  of  the  unit 
PA  cell)  is  obtained  from  load  pull  simulations  of  the  PA.  The 
pitch  of  the  unit  PA  slice,  P,  sets  the  minimum  lengths  of 
each  t-line  segments,  such  that  Ltli  >=  P/2,  Ltl2  >=  P, 
etc.  Since  the  insertion  loss  of  the  combiner  has  a  strong 
correlation  to  the  total  electrical  length  of  the  combiner, 
minimizing  P  helps  reduce  combiner  loss  and  improves  overall 
efficiency.  With  a  pitch  of  150  /im,  the  designed  ZDC  has 
a  total  electrical  length  of  0.27  A  (510  /im),  leading  to  an 
overall  insertion  loss  of  0.53  dB  while  presenting  each  unit  PA 
with  an  impedance  Zopt  =  10  +  j  15  £1  Since  the  impedance 
transformation  is  integral  to  the  ZDC,  additional  impedance 
transformation  and  its  associated  insertion  loss  is  avoided. 

As  shown  in  Fig.  1,  each  four- way  PA  branch  is  fed  by  a 
4-way  zero-degree  divider.  The  power  divider  shares  the  same 
design  methodology  as  the  combiner.  The  input  to  each  unit 
PA  is  matched  to  each  of  the  50  U  input  ports  by  properly 
sizing  the  t-lines  in  the  divider. 

III.  Power  Amplifier  Design 

Figure  3  shows  the  schematic  of  the  unit  PA  cell.  A  2- stage 
cascode  topology  is  adopted,  where  the  thin  gate  floating-body 
NFETs  M1/M2  and  M3/M4  are  sized  at  250  /im  and  120  /im, 
respectively.  The  device  sizes  of  the  two  stages  are  selected 
such  that  the  output  stage  saturates  prior  to  the  driver  stage, 
ensuring  that  Psat  is  not  limited  by  the  driver.  The  gate  of 
each  of  the  cascode  devices  (i.e.  M2  and  M4)  are  shunted  with 
properly  sized  capacitors,  such  that  some  of  the  drain  voltage 
swing  would  couple  to  the  gate  node  through  the  capacitive 
divider.  The  peak  drain-gate  voltage  of  the  cascode  device  is 
reduced  as  a  result,  improving  reliability. 


Fig.  4.  Die  micrograph  of  the  PA. 


The  two  devices  in  each  cascode  stage  are  laid  out  as  a 
single  row  of  gate  fingers,  as  opposed  to  having  the  two  de¬ 
vices  laid  out  separately.  In  addition,  the  side- wall  capacitance 
between  the  source  and  drain  metal  stacks  is  also  minimized 
by  “tapering”  the  cross  sections  of  the  stacks.  These  layout 
practices  minimize  the  parasitics  of  the  devices,  improving 
both  gain  and  efficiency  of  the  PA.  Similar  to  the  combiner, 
coplanar  waveguide  t-lines  are  used  to  implement  the  RF 
chokes  and  inductive  matching  elements.  Capacitors  in  the 
matching  networks  are  implemented  as  inter-digitated  metal 
sidewall  capacitors. 

Figure  4  shows  the  die  micrograph  of  the  PA.  The  PA 
measures  0.69  mm  by  1.41  mm  (1  mm2),  including  pads.  A 
continuous  ground  plane  on  the  top  aluminum  layer  provides  a 
low  inductance  current  path  from  the  PA  devices  to  the  external 
ground.  Note  that  the  ground  plane  near  the  power  combiner  is 
slotted  in  order  to  conform  with  foundry  metal  spacing  rules. 
The  two  input  power  dividers  are  positioned  below  the  top 
metal  ground  plane,  thus  they  are  not  visible  from  the  die 
micrograph.  As  mentioned  in  the  previous  section,  the  layout 
pitch  of  the  unit  PA  slices  should  be  minimized  in  order  to 
achieve  a  more  compact  and  efficient  combiner.  T-line  RF 
chokes  for  both  the  driver  and  output  stage  are  intentionally 
made  short  to  reduce  pitch. 

IV.  Measurement  Results 

The  W-band  PA  was  measured  with  coaxial  probes  on  a 
probe  station.  As  shown  in  Fig.  5,  the  two  input  ports  were 
each  fed  by  a  6x  frequency  multiplier  module,  which  is  capable 
of  delivering  up  to  12  dBm  power  in  the  W-band.  The  output 
power  of  the  PA  was  measured  with  a  waveguide  power  sensor. 
The  relative  phase  of  the  two  signal  sources  is  adjusted  so  that 
the  PA  operates  at  peak  output  power.  Losses  of  all  external 
components  were  calibrated  out  across  frequency. 


TABLE  I 

Performance  comparison  of  recent  W-band  PAs  in  silicon. 


Reference 

Technology 

Freq. 

(GHz) 

Psat 

(dBm) 

PAE 

(%) 

Supply 

(V) 

Gain 

(dB) 

Area 

(mm2) 

Power  Combining 

This  Work 

45nm  SOI  CMOS 

80 

21.1 

5.2 

2 

10.1 

1 

8-Way  Zero-Degree 

[3] 

65nm  CMOS 

79 

19.3 

19.2 

1 

24.2 

1.23 

8-Way  Transformer 

[4] 

45nm  SOI  CMOS 

89 

17 

9 

4.2 

8 

0.26 

3-Stacked 

[5] 

45nm  SOI  CMOS 

90 

19 

8.9 

6.8 

36 

2.2 

6-stack  DAC 

[6] 

65nm  CMOS 

90 

18.3 

9.5 

1.2 

12.5 

0.82 

16-way  combined 

VDD  [V] 


Fig.  7.  Measured  Pout  and  PAE  versus  supply  voltage  at  80GHz. 


Figure  6  shows  the  measured  saturated  output  power  and 
PAE  over  the  frequency  range  from  67  GHz  to  90  GHz.  Peak 
performance  was  achieved  at  80  GHz,  with  an  output  power  of 
21.1  dBm  and  5.2%  PAE.  The  output  power  3-dB  bandwidth 
was  12  GHz.  As  shown  in  Fig.  7,  the  supply  voltage  was 
swept  from  0.9  V  to  2.2  V  and  output  power  varied  from  13.4 
dBm  to  21.4  dBm,  while  PAE  remained  relatively  constant 
over  most  of  the  supply  sweep  range. 

The  performance  of  the  PA  is  summarized  and  compared  to 
recent  W-band  CMOS  PAs,  as  shown  in  Table  I.  The  achieved 
Psat  of  21.1  dBm  is  1.8  dB  (51%)  higher  than  prior  state  of 
art. 


V.  Conclusion 

This  letter  presents  a  W-band  PA  with  21.1  dBm  Psat  and 
5.2%  peak  PAE  in  45nm  SOI  CMOS.  The  PA  achieves  8- 
way  power  combining  with  a  low-loss,  area-efficient  ZDC  that 
offers  built-in  impedance  transformation.  A  2-stage  cascode 
topology  is  adopted  in  the  unit  PAs,  and  an  overall  gain  of  10.1 
dB  is  demonstrated  at  80  GHz.  To  the  best  of  our  knowledge, 
this  PA  has  the  highest  Psat  among  reported  W-band  CMOS 
PAs. 
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Abstract — We  present  a  compact,  high-gain  power  ampli¬ 
fier  (PA)  that  achieves  high  power  and  high  output  impedance 
through  a  4-stacked  architecture.  By  proper  adjustment  of 
device  sizes  and  bias  voltages,  the  optimum  load  impedance 
for  maximum  Psat  is  moved  close  to  50Ohm  which  eliminates 
the  need  for  an  output  matching  network.  The  single-stage 
PA  exhibits  a  high  gain  of  16  dB  which  is  more  than  twice  the 
gain  of  previously  reported  PAs  and  19.2dBm  Psat  and  19% 
PAE  which  are  comparable  to  the  state-of-the-art  CMOS 
PAs  at  Q-band.  The  very  compact  area  of  0.09  mm2  and  the 
high  gain  makes  this  design  a  suitable  unit  PA  to  be  used 
for  further  parallel  power  combining  to  reach  Watt  levels  of 
output  power. 

I.  Introduction 

Integration  of  radio-frequency  circuits  in  deeply  scaled 
CMOS  technology  is  an  ever  increasing  goal  of  researchers 
and  the  industry.  One  of  the  challenges  in  RF  CMOS 
implementation  is  achieving  high  output  power  and  effi¬ 
ciency  in  power  amplifiers  (PA)  despite  the  low  breakdown 
voltages  of  the  thin  gate  oxide.  One  way  to  increase  the 
output  power  is  to  increase  the  drain  current  by  increasing 
the  number  of  parallel  devices,  which  in  turn  reduces 
the  required  optimum  load  impedance,  Zopt9  to  imprac¬ 
tical  values.  Zopt  should  often  synthesized  from  50Ohm 
with  the  output  matching  network  which  may  reduce  the 
bandwidth  of  the  PA  and  consume  significant  die  area  as 
well.  Alternatively  one  could  increase  the  voltage  swing 
through  the  use  of  multi- stack  transistors.  In  this  case  the 
drain  current  is  kept  relatively  constant  and  drain-source 
voltages  of  multiple  devices  add  up  across  the  load.  As  the 
voltage  swing  increases  with  more  stacked  devices,  the  PA 
requires  a  higher  optimum  load  impedance  for  delivering 
the  maximum  output  power  and  efficiency. 

Two-level  cascoded  devices  have  demonstrated  state-of- 
the-art  PAE  at  mm- waves  [1],  [2],  but  the  highest  output 
power  is  reported  for  3-  and  4-stacked  PAs  at  the  expense 
of  reduced  PAE  due  to  the  series  loss  of  the  additional 
transistors. 

In  this  paper,  we  present  a  4-stacked  PA  which  achieves 
a  maximum  Psat  of  19.2  dBm  and  a  peak  PAE  of  19%. 
This  performance  is  comparable  to  the  highest  reported 
PAs  in  the  same  technology  and  at  the  same  frequency 
range  but  provides  16  dB  of  large  signal  (LS)  gain  which 


Fig.  1.  Schematic  diagram  of  4-stacked  PA  with  qualitative 
demonstration  of  voltages  on  gate  and  drain  nodes 

is  a  factor  of  two  higher  than  previously  reported  single- 
stage  Q-band  CMOS  PAs  and  occupies  almost  50%  less 
die  area.  The  high  gain  and  compact  size  are  advantageous 
when  this  amplifier  will  be  used  as  a  unit  PA  in  a  higher 
level  parallel  power  combiner  to  achieve  Watt  level  output 
power. 

II.  Power  Amplifier  Design 

Fig.  1  shows  the  schematic  diagram  of  the  presented  PA. 
It  consists  of  4  stacked  NMOS  devices  each  with  a  total 
gate  width  of  150pm.  Input  signal  is  applied  to  the  gate 
of  Ml  through  an  input  impedance  matching  network.  The 
current  is  then  passed  through  M2-M4  to  reach  the  output 
node.  As  shown  qualitatively  in  Fig.  1,  voltage  swings 
are  increased  from  one  drain  node  to  another  to  achieve 
the  higher  peak  output  power.  The  key  difference  between 
this  “series”  voltage  power  combining  approach  and  the 
conventional  cascode  topology  is  that  the  gate  voltages 
of  M2-M4  are  not  kept  at  a  constant  voltage,  but  rather 
are  allowed  to  have  an  ac  component.  This  ensures  that 
the  instantaneous  gate-drain  (Vgd)  and  gate-source  (Vgs) 


voltages  do  not  exceed  the  breakdown  limits.  The  voltage 
on  each  gate  is  determined  by  the  capacitive  divider  of 
Cgd  of  each  FET  and  the  Cuas  of  that  gate.  Fig.  2  shows 
the  simulated  drain  waveforms  of  each  FET  as  well  as 
their  corresponding  Vds-  It  is  clear  that  while  the  voltage 
across  the  load  can  achieve  very  high  voltages  (~8V)  the 
individual  values  of  Vds  never  exceed  3.1  V. 

By  proper  choice  of  device  sizes  and  bias  voltages,  the 
optimum  load  impedance  of  the  PA  is  made  as  close  as 
possible  to  50Ohm  .  The  output  network  could  therefore  be 
completely  eliminated  as  shown  in  Fig.  1.  Fig.  3  shows  the 
simulated  load-pull  Pout  contours  for  the  nominal  supply 
voltage.  In  this  simulation,  the  RF  choke  is  treated  as  part 
of  the  device  under  test.  The  load  impedance  that  leads  to 
peak  Pout  is  around  25— j25£2  at  the  design  frequency.  The 
imaginary  component  can  be  conveniently  achieved  by  the 
shunt  capacitance  of  the  output  probe  pad.  The  real  part 
of  Zopt  is  still  less  than  50  Ohms  but  the  P-idB  contours 
are  large  enough  to  extend  to  50Ohm.  Most  impedance 
matching  networks  have  insertion  losses  of  1-2  dB  at 
Q-band,  and  thus  would  mitigate  any  improvement  that 
impedance  matching  to  an  ideal  Zopt  could  provide. 

Matching  each  individual  FET  in  the  4-stack  helps 
achieve  a  high  gain,  while  maximizing  the  output  power 
capability  hence  the  power  efficiency  of  the  PA.  This  is 
achieved  by  using  optimally  sized  transmission  lines  to 
resonate  the  parasitic  capacitance  of  the  successive  FET. 
These  parasitic  elements  can  have  significant  contribution 
to  loss  of  the  signal  in  a  mm-wave  PA.  The  transmission 
lines  also  help  physically  separate  the  individual  devices 
and  achieve  a  more  even  thermal  distribution  across  the 
chip. 

The  circuit  was  designed  and  implemented  in  a  45 
nm  SOI  CMOS  process  with  an  ft  >  400 GHz  [7]. 
The  transmission  lines  used  in  the  input  and  inter-stage 
networks  as  well  as  the  RF  choke  were  implemented 
using  a  shielded  microstrip  structure  to  eliminate  coupling 
to/from  other  transmission  lines.  A  custom  scalable  lumped 
circuit  model  was  developed  for  the  transmission  lines  to 
facilitate  the  design  and  simulation  of  the  PA.  Fig.  4  shows 
the  chip  photograph  of  the  PA  and  as  can  be  seen  the 
core  of  the  design  is  fitted  in  the  very  compact  area  of 
0.09  mm2. 


III.  Measurement  Results 

The  PA  was  measured  using  an  Agilent  8257D  source 
and  an  N1913  power  meter.  Stable  operation  of  the  power 
amplifier  was  verified  by  monitoring  the  output  signal  on 
an  HP8563E  Spectrum  analyzer.  Power  levels  are  however 
read  from  the  power  meter  for  greater  accuracy.  Cable 
and  probe  losses  were  measured  prior  to  the  test  and 
taken  into  account  in  the  presented  data.  The  single-stage 
power  amplifier  provides  16dB  of  power  transducer  gain 
with  a  3dB  bandwidth  extending  over  35GHz  to  49GHz. 


Fig.  2.  Simulated  waveforms  (a)at  each  drain  and  (b)across  each 
device 


Fig.  3.  Simulated  power  contours  in  dBm  for  VDD=4.0  V 


Fig.  4.  Photograph  of  the  PA  chip  with  active  area  of  0.09  mm2 


Fig.  5  shows  the  measured  saturated  output  power  and 
PAE  of  the  amplifier  over  frequency.  It  can  be  seen  that 
the  amplifier  is  indeed  providing  the  constant  output  power 


TABLE  I 

Comparison  to  state-of-the-art  mm-wave  PAs 


Ref 

Technology 

Architecture 

Freq. 

(GHz) 

BW 

(GHz) 

Supply 

(V) 

F  sat 

(dBm) 

Peak  PAE 

(%) 

Gain 

(dB) 

Area 

(mm2) 

This  Work 

45  nm  SOI 

4- stack 

42 

15 

5 

19.2 

19 

16 

0.09 

[3] 

45  nm  SOI 

4- stack 

47.5 

N/A 

5 

20.3 

19.4 

12.8 

0.16 

[4] 

45  nm  SOI 

4- stack 

41 

>  6 

5 

21.6 

25.1 

8.9 

0.30 

[5] 

45  nm  SOI 

3x2  stack  (DAC) 

5-45 

25* 

6.6 

16.5 

-  8 

N/A 

0.23 

[3] 

45  nm  SOI 

2-way;  2-stack 

47.5 

N/A 

2.9 

19.1 

16 

8.2 

0.43* 

[6] 

45  nm  SOI 

Doherty  2-stack 

42 

N/A 

2.5 

18 

23 

7 

0.64 

[1] 

45  nm  SOI 

3 -stack 

45 

>4 

2.7 

18.6-19.4 

32-33.9 

9.5 

0.30 

*  Estimated  from  plot  or  picture  in  referenced  paper. 


Frequency  (GHz) 


Fig.  5.  Measured  Psat  and  PAE  vs.  frequency 


Input  Power  (dBm) 

Fig.  6.  Measured  Pout  and  PAE  vs.  input  power  at  VDD=4A\ 
and  5V. 


of  18-19dBm  to  a  50  Ohm  load  over  the  frequency  range 
of  interest.  Fig.  6  shows  the  output  power  and  PAE  of  the 
PA  at  41  GHz.  Power  consumption  of  the  chip  is  calculated 
with  the  measured  dc  current  for  each  applied  input  power. 
With  a  5V  supply,  the  power  amplifier  can  provide  up  to 
19.2dBm  saturated  output  power  and  has  a  peak  PAE  of 
18.5%.  At  a  supply  voltage  of  4.4V,  the  peak  output  power 
is  17.8dBm  and  the  peak  PAE  is  19%. 


IV.  Conclusion 

Table  I  summarizes  the  performance  of  the  presented 
PA  in  comparison  to  recent  works  in  Q-band  and  45nm 
SOI  technology.  This  work  has  2-4  times  higher  gain  than 
previously  reported  PAs.  High  gain  is  important  to  relax 
the  requirement  for  pre-driver  stages  which  can  be  power 
and  area  hungry  in  a  practical  system.  The  presented  PA 
will  therefore  results  in  an  overall  higher  PAE  of  the 
transmitter.  The  area  is  almost  50%  smaller  than  the  next 
smallest  PA,  but  with  a  comparable  Psat  and  PAE. 
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Abstract — This  paper  presents  an  energy-efficient  high- 
throughput  and  high-precision  signal  component  separator  (SCS) 
chip  design  for  the  asymmetric-multilevel-outphasing  (AMO) 
power  amplifier.  It  uses  a  fixed-point  piece-wise  linear  functional 
approximation  developed  to  improve  the  hardware  efficiency  of 
the  outphasing  signal  processing  functions.  The  chip  is  fabricated 
in  45  nm  SOI  CMOS  process  and  the  SCS  consumes  an  active 
area  of  1.5  mm2.  The  new  algorithm  enables  the  SCS  to  run  at  a 
throughput  of  3.4  GSamples/s  producing  the  phases  with  12-bit 
accuracy.  Compared  to  traditional  low-throughput  AMO  SCS 
implementations,  at  0.8  GSamples/s  this  design  improves  the  area 
efficiency  by  25 X  and  the  energy-efficiency  by  2x.  This  fastest 
high-precision  SCS  to  date  enables  a  new  class  of  high-throughput 
mm-wave  and  base  station  transmitters  that  can  operate  at  high 
area,  energy  and  spectral  efficiency. 

Index  Terms — Application  specific  integrated  circuits  (ASIC), 
asymmetric  multi-level  outphasing  (AMO)  power  amplifier,  base¬ 
band,  energy  efficiency,  linear  amplification  by  nonlinear  compo¬ 
nent  (LINC),  Signal  component  separator  (SCS),  throughput. 


I.  Introduction 

HIGH-THROUGHPUT  wireless  communication  systems 
working  at  the  millimeter-wave  (mm-wave)  frequency 
range  from  60  GHz  to  90  GHz  [1  ]— [7]  have  recently  become 
the  focus  of  research  and  development  activity.  The  availability 
of  large  chunks  of  bandwidth  and  maturity  of  CMOS  process 
technology  provide  the  opportunity  to  address  several  large  mar¬ 
kets  with  bandwidth-demanding  communication  applications. 
Meanwhile,  these  mm-wave  applications  place  great  challenges 
on  the  transceiver  design,  due  to  factors  such  as  power-ampli¬ 
fier  (PA)  efficiency  and  linearity,  high  wireless  channel  loss  and 
multipath,  increasing  parasitics  for  passive  components,  limited 
amplifier  gain  etc.  Even  in  cellular  base  stations,  the  drive  to¬ 
ward  flexible,  multi- standard  radio  chips,  increases  the  need  for 
high-precision,  high-throughput  and  energy-efficient  backend 
processing.  The  desire  to  best  leverage  the  available  spectrum 
for  these  high-throughput  applications,  creates  the  demand  for 
high-efficiency  and  high-linearity  PAs.  While  these  conflicting 
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PA  design  requirements  have  been  satisfied  in  the  past  at  low 
system  throughputs  by  designing  smart  digital  back-ends,  the 
multi-GSamples/s  throughput  required  in  new  applications  puts 
a  significant  challenge  on  digital  baseband  system  design  to  per¬ 
form  the  necessary  modulation  and  predistortion  operations  at 
negligible  power  overhead. 

This  desire  for  high-throughput  energy-efficient  digital  base¬ 
band  becomes  especially  prominent  for  the  outphasing  PAs 
designed  to  improve  the  efficiency  while  satisfying  the  high-lin- 
earity  requirements  for  higher-order  signal  constellations.  At 
low  throughputs  (10-100  MSamples/s),  the  outphasing  PAs 
would  rely  on  complex  digital  signal  processing  to  generate  the 
outphasing  vectors  and  make  it  possible  to  use  simple,  high-effi¬ 
ciency  switching  PAs  on  each  path.  Examples  of  the  outphasing 
PAs  include  the  linear-amplification-by-nonlinear-component 
(LINC)  PA  proposed  by  Cox  in  [8],  and  its  more  recent  mod¬ 
ification:  the  asymmetric-multilevel-outphasing  (AMO)  PA 
[9]— [  11].  At  high  (multi-GSamples/s)  throughputs,  however, 
a  radical  redesign  of  the  signal  component  separator  (SCS) 
digital  signal  processing  implementations  is  needed  to  prevent 
degradation  in  net  power  efficiency  due  to  significant  increase 
of  digital  baseband  power  consumption. 

The  conventional  LINC  SCS  has  been  traditionally  imple¬ 
mented  both  in  analog  and  digital  designs  [12]— [14].  The  analog 
versions  of  SCS  are  obviously  not  suitable  for  high-speed  and 
high-precision  applications,  so  we  only  consider  the  digital  SCS 
implementations.  The  SCS  decomposes  the  original  sample 
signal  to  two  signals  as  required  by  the  LINC/ AMO,  and  the 
decomposition  involves  the  computations  of  several  nonlinear 
functions.  For  digitally  implemented  SCS,  a  look-up-table 
(LUT)  is  the  most  common  way  to  realize  the  nonlinear  func¬ 
tions.  Considering  that  the  past  signal  separators  mainly  work 
below  100  MSamples/s  with  low  to  medium  precision,  LUT 
indeed  is  the  simplest  and  most  energy-efficient  approach. 
Even  for  the  recent  AMO  architecture,  LUT  is  still  a  preferable 
choice  for  operations  under  100  MSamples/s  [15].  However, 
the  traditional  LUT-based  function  map  quickly  becomes  in¬ 
feasible  when  the  throughput  and  precision  requirements  go  up 
to  multi-GSamples/s  and  more  than  10-bit  range.  The  LUT  size 
becomes  prohibitively  large  for  on-chip  implementations  and 
gives  the  penalty  in  both  area  and  speed.  Besides,  the  number 
of  LUTs  used  in  the  AMO  SCS  is  significantly  larger  than  in 
the  LINC  SCS,  so  the  LUT  solutions  that  can  barely  work  for 
LINC  render  AMO  implementations  infeasible.  On  the  other 
hand,  at  these  high  throughputs  a  direct  nonlinear  function 
synthesis  through  iterative  algorithms  like  CORDIC  [16]  or 
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TABLE  I 

LINC  and  AMO  SCS  Equations 


LINC  Equations 

AMO  Equations 

A  =  yjl2  +  <22.0  =  arctan(^)  (lincl) 

A  =  sjl1  +  <22,0  =  arctan(^)  (amol) 

a  =  arccos(^)  (linc2) 

oci  -  arccos(-L 2),a2  -  arccos(  2^2  ')  (amo2) 

cp,  =  0  +  a.  <p2  =  0  -  a  (linc3) 

cpi  =  0  +  oti ,  <p2  =  0  -  0C2  (amo3) 

A*)  -  i-hA*)  (amo4) 

nonlinear  filters  [17]  proves  to  be  more  area  compact  but  with 
prohibitive  power  footprint  for  the  overall  power  efficiency  of 
the  PA. 

In  this  paper,  we  present  the  function  synthesis  algorithms 
and  a  corresponding  chip  implementation,  designed  using  an 
alternative  approach  to  compute  the  nonlinear  functions,  which 
is  both  more  area  and  energy-efficient  than  state-of-the-art 
methods  like  LUTs,  CORDIC  or  nonlinear  filters.  The  chip 
results  demonstrate  an  AMO  SCS  working  at  3.4  GSamples/s 
with  12-bit  accuracy  and  over  2  x  energy  savings  and  25  x  area 
savings  compared  to  traditional  AMO  SCS  implementation. 
The  new  approach  is  based  on  the  piece-wise  linear  (PWL) 
approximation  of  a  nonlinear  function.  The  approximation 
consists  of  the  computations  of  LUT,  add,  and  multiply.  In 
order  to  minimize  the  computational  cost  while  maintaining 
high  accuracy  and  throughput,  we  propose  a  novel  algorithm 
to  find  the  fixed-point  representation  of  the  approximation. 
The  idea  of  the  fixed-point  version  of  the  approximation  is  to 
use  as  few  operations  as  possible  and  minimize  the  number  of 
input  bits  to  all  the  operations  so  as  to  achieve  high  throughput. 
With  these  considerations,  we  are  able  to  achieve  a  fixed-point 
representation  of  typical  LINC  or  AMO  nonlinear  functions, 
which  consists  of  one  small  LUT,  one  adder  and  one  multiplier. 
The  hardware  architecture  derived  from  this  special  algo¬ 
rithm  achieves  a  nice  balance  among  area,  energy-efficiency, 
throughput  and  computation  accuracy,  which  will  be  presented 
in  details  in  the  rest  of  the  paper. 

The  paper  is  organized  as  follows.  In  Section  II,  we  present 
the  basic  principles  of  LINC  and  AMO  PA  architectures  and 
their  corresponding  SCSs.  In  Section  III,  we  introduce  the  pro¬ 
posed  approximation  algorithm  and  an  example  to  illustrate  its 
derivations  and  advantages.  In  Section  IV,  we  present  the  chip 
design  of  the  digital  baseband  system  and  the  microarchitecture 
of  each  block,  followed  by  the  chip  measurement  results.  We 
conclude  the  paper  in  Section  V. 

II.  System  Overview 

Both  LINC  and  AMO  PAs  are  outphasing  PA  architectures 
and  their  digital  basebands  perform  similar  computations.  The 
LINC  PA  architecture  is  proposed  by  Cox  in  [8]  with  the  moti¬ 
vation  to  relieve  the  ever  existing  trade-off  between  the  power 
efficiency  and  linearity  performances  of  the  PA.  By  decom¬ 
posing  the  transmitted  signal  to  two  constant-amplitude  signals, 
high-efficiency  PAs  can  be  used  to  amplify  the  two  decomposed 
signals  without  sacrificing  the  linearity.  The  AMO  PA  archi¬ 
tecture,  proposed  in  [9]— [  11]  improves  the  average  power  effi¬ 
ciency  further  by  allowing  the  two  PAs  switch  among  a  discrete 
set  of  power  supplies  rather  than  fixing  on  a  single  supply  level. 


I,  Q  symbofs 


(b) 


Fig.  1.  (a)  LINC,  AMO  SCS.  (b)  AMO  PA  system  overview. 


Fig.  1(a)  shows  the  working  schemes  of  LINC  SCS  and  AMO 
SCS  for  an  arbitrary  IQ  sample  (I,Q).  The  SCS  decomposes  the 
(I,Q)  to  two  signals  with  phases  of  tpi,  ip 2  and  amplitudes  of 
ai,  CL2,  where  for  LINC  a\  —  (12  =  a.  The  outphasing  angles 
ip  1  and  if2  for  both  architectures  are  derived  from  the  equations 
summarized  in  Table  I.  In  AMO  equations,  ai,  a2  denote  the 
power  supplies  of  the  two  PAs  respectively.  a\ ,  <12  are  restricted 
to  the  set  of  V  =  {Pi,  P2,  P3,  P4},  where  Pi  <  P2  <  P3  < 
V4  are  the  four  levels  of  supply  voltages.  Equations  in  (amo4) 
of  Table  I  are  in  the  signal  decomposition  process  simply  due 
to  the  architecture  requirement  from  the  digital-to-RF -phase- 
converter  (DRFPC)  [18],  which  converts  the  digital  outputs  to 
RF  modulated  signals  and  takes  a  function  of  the  phase  f(cp) 
as  the  input.  Generally,  computations  in  (amo4)  depend  on  the 
type  of  the  modulator  and  may  be  different  than  what  we  present 
here. 

The  typical  low-throughput  LINC  SCS  and  recent  AMO  im¬ 
plementations  [12]— [1 5],  [19]  usually  involve  the  use  of  co¬ 
ordinate  rotational  digital  computer  (CORDIC)  [16]  and  LUT 
map  for  the  nonlinear  functions  in  Table  I  [14],  [20].  The  ma¬ 
turity  of  the  CORDIC  algorithm  and  simplicity  of  the  LUT 
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approach  make  themselves  suitable  for  the  LINC  SCS  appli¬ 
cations  whose  throughput  is  below  100  MSamples/s  and  with 
low  to  medium  resolution  (<8  bits  for  example).  However,  the 
approaches  become  less  attractive  or  even  prohibitive  for  our 
target  mm-wave  wideband  applications  where  the  throughput  is 
in  the  multi- G  Sample  s/s  range  with  high  phase  resolution  (>  10 
bits  for  example).  In  the  next  section,  we  show  our  proposed  so¬ 
lution:  using  fixed-point  PWL  approximations  on  the  nonlinear 
functions  which  provides  a  balance  among  accuracy,  power  and 
area. 


III.  Proposed  Piece-Wise  Linear  Approximation 
A.  Algorithm 

The  motivation  for  a  new  approach  to  the  nonlinear  func¬ 
tion  computation  is  simple:  avoid  and  replace  complex  compu¬ 
tations  with  simple  and  energy-efficient  computations.  For  ex¬ 
ample,  table  look-up  with  LUTs  of  reasonable  sizes,  adders  and 
multipliers  are  the  favorable  computations  to  perform.  We  also 
realize  that  all  functions  involved  in  the  SCS  computations  are 
smooth  in  almost  the  whole  input  range.  Hence,  they  are  suitable 
to  be  approximated  by  functions  with  simple  structured  basis 
functions,  such  as  polynomials,  splines  and  etc.  These  consid¬ 
erations  lead  us  to  the  PWL  function  approximation  of  the  non¬ 
linear  functions. 

Fig.  2(a)  shows  the  general  application  of  the  PWL  approxi¬ 
mation  to  any  smooth  nonlinear  function.  The  input  x  is  divided 
into  several  intervals,  where  a  linear  function  yi  =  x  x  +  g, 
x  G  [xt ,  Xi+i)  is  constructed  in  each  interval  to  approximate  the 
actual  function  value  in  that  range.  With  this  approximation,  the 
computation  of  the  nonlinear  function  only  consists  of  the  linear 
function  computation  in  each  interval  (add  and  multiply),  plus  a 
relatively  small  LUT  for  the  linear  function  parameters  a,i  ,  G  in 
each  interval.  In  terms  of  accuracy,  for  any  function  which  has  a 
continuous  second-order  derivative,  the  approximation  error  is 
bounded  by  the  interval  length,  the  second-order  derivative  and 
does  not  depend  on  higher-order  derivatives,  as  shown  in  [21], 

\error\  <  \(xi+i  -  Xi)2  max  \y"{x)\  .  (1) 

o  Xi<.x<.Xi- |_i 

Here,  xt ,  are  the  boundaries  of  the  ith  interval  and  y "  is 
the  second-order  derivative  in  x.  We  observe  that  the  approx¬ 
imation  error  can  be  made  arbitrarily  small  as  we  increase  the 
number  of  approximation  intervals.  These  initial  examinations 
on  the  computational  complexity  and  approximation  accuracy 
of  the  piece-wise  linear  approximation  make  it  an  appealing  al¬ 
ternative  technique  for  the  LINC  and  AMO  SCS  designs. 

In  order  to  benefit  from  the  nice  properties  of  the  PWL  ap¬ 
proximation,  we  need  to  tailor  it  to  be  hardware-implementation 
friendly.  Most  importantly,  all  the  arithmetic  computations  have 
to  be  converted  to  their  fixed-point  counterparts,  and  the  ques¬ 
tion  is  whether  the  resulting  fixed-point  computations  are  able 
to  operate  at  multi-GSamples/s  throughputs  with  high  accuracy. 
The  most  seemingly  obvious  solution  is  a  direct  quantization 
of  the  parameters  in  the  floating-point  representation  of  the  ap¬ 
proximation  formula.  However,  this  may  not  be  an  optimal  solu¬ 
tion  if  throughput  is  the  major  concern  and  bottleneck,  because 
the  operands  of  the  add  and  multiply  ai ,  are  quantized  to  have 


yi  =  */  (*2  -Si)  +  bi 
(b) 

Fig.  2.  (a)  The  general  concept  of  PWL  approximation,  (b)  Proposed  fixed- 
point  PWL  approximation. 


the  same  long  bits  as  the  output,  and  these  long-bit  arithmetics 
are  likely  to  be  in  the  critical  timing  path.  Further  optimization 
of  the  long  multiplication  would  only  add  complexity  to  the  de¬ 
sign.  In  what  follows,  we  present  a  modified  formulation  of  the 
fixed-point  PWL  approximation  and  show  its  capability  of  run¬ 
ning  at  a  much  higher  throughput  than  the  direct  quantization 
version  of  the  approximation. 

The  setup  of  our  problem  is  to  compute  a  nonlinear  function 
of  m-bit  output  with  m-bit  input  x  £  [0,1),  using  the  PWL  ap¬ 
proximation.  An  m-bit  input  x  can  be  decomposed  to  x\  and 
X2  as  x  =  [  x\  ,  X2  ],  where  m  =  mi  +  m2. 


mi— MSB  bit  m2  —  LSB  bit 

Naturally,  x\  divides  the  input  range  to  2mi  intervals  and  it  is 
the  indexing  number  of  those  intervals.  Fig.  2(b)  shows  an  en¬ 
largement  of  the  zth  interval  of  the  approximation,  where  x\ 
takes  its  zth  value,  and  X2  takes  2m2  values,  ranging  from  0  to 
2m2  —  1.  Under  this  setup,  we  have  our  proposed  fixed-point 
scheme  shown  in  (2). 


y%—  bi^l  +kj(x 2  -  Si  •  1),  i  =  0, 1, . . .  2mi  -  1. 

mi -MSB  bit  m2 -LSB  bit 

(2) 

Here,  =  [y([i,  0]),  y([i,  1]),  •  •  • ,  y([i,  N2  -  1])]T, 

=  (1/JV)[0, 1,-  ■  -  ,JV2  -  1]T,  1  =  [1,1,  ,  1]T  E  RN2, 

Ni  =  2mi,N2  =  2TO2,  N  —  2m,m  =  m1+m2,ki,  Shbi  E  R 

and  they  are  all  fixed-point  numbers. 

The  underlying  idea  of  this  formulation  is  to  compute  the 
m-bit  output  part  by  part.  In  the  linear  function  of  each  interval, 
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we  use  the  term  bi  to  represent  the  most  significant  mi  bits  of 
the  function  value,  and  the  term  ki  •  (x2  ~  Si  •  1)  to  achieve 
the  lower-significant  m2  bits  of  accuracy.  Then  yi  is  simply 
the  concatenation  of  the  two  parts.  The  procedures  to  find  the 
fixed-point  representations  of  the  three  parameters  ki ,  Si ,  bi  in 
(2)  are  described  in  the  following  steps. 

Step  1:  Obtain  the  Floating-Point  Version  of  the  PWL  Ap¬ 
proximation:  The  optimal  real  coefficients  of  the  linear  function 
in  each  interval  in  terms  of  l2  norm  can  be  found  by  least-square 
optimization  (3),  where  the  design  variables  are  k\  and  f  £  R. 
The  superscripts  denote  that  they  are  floating-point  real  num¬ 
bers;  X2  and  yi  are  defined  as  in  (2). 


min  || yi  -  (fc[  •  x2  +  %  ■  1)||2,  for  i  =  0, 1,  2, . . . ,  Ni  -  1, 

kr  ,br 

(3) 

The  approximation  error  bound  in  (1)  shows  that  the  error  is  pro¬ 
portional  to  (xi+i  —  Xi )2,  which  in  the  fixed-point  input  case, 
equals  2_2mi .  Let  mi  =  |~m/2~| ,  then  it  is  possible  to  realize 
the  required  output  m-bit  accuracy  with  only  2^m/2^  intervals. 
Since  the  number  of  intervals  determines  the  number  of  address 
bits  of  the  LUT  that  stores  the  parameters  of  the  linear  func¬ 
tion  in  each  interval,  this  LUT  (2^m/2^  entries)  is  considerably 
smaller  than  a  direct  map  from  input  to  output  (2m  entries).  The 
following  steps  determine  the  fixed-point  parameter  values,  i.e., 
the  content  of  the  LUT. 

Step  2:  Obtain  the  Fixed-Point  Value  bi:  bi  can  be  achieved 
simply  by  quantizing  the  f  to  mi -bit.  As  we  mentioned  before, 
the  m-bit  output  is  constructed  part  by  part  with  bi  as  the  con¬ 
stant  term  in  the  zth  interval,  representing  the  major  part  of  the 
function  value  in  that  interval.  As  long  as  the  functional  value 
increment  in  each  interval  is  less  than  2-mi,  that  is,  the  func¬ 
tional  derivative  \y'(x)\  <  1,  it  is  enough  to  use  the  mi-MSB 
of  bi  to  represent  the  mi -MSB  of  the  output. 

Step  3:  Obtain  the  Fixed-Point  Value  Si :  Since  Step  2  yields 
a  bi  with  a  maximum  quantization  error  of  2_mi,  to  compen¬ 
sate  for  the  accuracy  loss  of  If  -  bi ,  an  extra  parameter  SI  is 
introduced  such  that  k^ S/  =  f  —  bi.  Its  fixed-point  counterpart 
Si  is  derived  as  in  (4) 


Si  =  quantize 


V  (K)  ) ' 


(4) 


The  number  of  bits  of  Si  is  determined  such  that  k\  Si  has  the 
accuracy  of  m  +  1  bits.  From  our  experience  with  the  functions 
involved  in  the  SCS  design,  Si  usually  has  the  number  of  bits 
around  or  a  few  more  (i.e.  2-4)  bits  than  m/2,  depending  on  the 
derivative  ki  of  the  function  in  each  interval. 

Step  4:  Obtain  the  Fixed-Point  Value  ki:  The  slope  of  the 
function  in  the  ith  interval  ki  can  also  be  obtained  by  simply 
quantizing  its  floating-point  counterpart  from  the  optimization 
procedure  in  Step  1.  As  shown  in  (2),  the  term  kfx 2  —  Si  •  1) 
contributes  to  the  second  part  of  the  output-  the  m2  LSBs.  Since 
X2  —  Si  has  an  accuracy  of  at  least  m  bits,  ki  has  to  have  at  least 
m2  bits  to  make  the  m2  LSBs  of  the  output. 

The  above  procedure  not  only  provides  a  way  to  obtain  the 
three  fixed-point  parameters  of  the  linear  function  in  each  in¬ 
terval,  but  also  provides  benefit  in  the  high-throughput  hard¬ 
ware  micro-architecture  design.  Fig.  3(a)  shows  the  micro-ar¬ 
chitecture  of  the  approximation  and  (b)  shows  more  clearly  how 


Concatenation 
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Fig.  3 .  (a)  Micro-architecture  of  the  PWL  approximation,  (b)  Illustration  of  the 
computations  in  the  PWL  approximation. 


the  computations  are  carried  out.  There  are  essentially  3  arith¬ 
metic  operations  involved:  LUT,  one  adder,  and  one  multiplier. 
The  LUT  takes  the  mi  MSBs  of  the  input  as  the  address  and 
outputs  the  parameters  Si  in  the  corresponding  interval. 
Then  the  linear  function  computations  follow  accordingly.  From 
Fig.  3(a),  we  notice  that  for  all  arithmetic  computations,  the 
operands  have  only  mi,  m2  or  ls  +  m2  bits,  but  not  m  bits 
as  input.  As  we  discussed  in  Step  1,  it  is  a  good  choice  to  set 
mi  =  [m/ 2] ,  hence  with  operands  of  m/2  bits  (roughly)  in  all 
computations,  we  are  able  to  achieve  the  m-bit  output. 

This  implies  two  important  improvements  in  hardware 
efficiency:  storage  and  throughput.  For  a  direct  LUT  imple¬ 
mented  function,  if  both  the  input  and  output  have  m  bits,  the 
storage  required  is  m  •  2m.  With  the  proposed  scheme,  the 
storage  is  (2m2  +  Is  +  mi)  •  2mi,  which  is  approximately 
1.5m  •  2 m/2  ~  2m  •  2m/2  assuming  mi  =  m2  =  m/2  (when 
m  is  even)  and  ls  small  (<4).  A  comparison  on  the  storage 
usage  between  the  direct  LUT  map  and  the  fixed-point  PWL 
approximation  approach  is  illustrated  in  Table  II,  for  practical 
range  of  m  from  10  to  16.  The  last  column  of  the  table  shows 
the  ratio  of  LUT  size  from  approximation  versus  the  one  from 
direct  LUT  map,  which  reflects  the  storage  savings  of  10-100  x 
for  the  range  of  values  of  interest.  The  net  area  advantage 
of  our  approach  versus  the  direct  LUT  will  depend  on  the 
actual  technology  and  throughput  specifications,  since  these 
would  dictate  the  type  of  the  storage  elements  being  used.  For 
example,  in  high-throughput  applications,  register-based  LUTs 
are  needed  while  in  lower  throughput  conditions,  SRAM-based 
LUTs  can  be  used.  Under  both  types  of  LUT  implementa¬ 
tions,  the  additional  area  consumption  brought  by  one  adder 
and  one  multiplier  is  almost  negligible  compared  to  the  LUT 
area.  For  example,  in  45  nm  SOI  technology,  the  direct  LUT 
implementation  of  a  16-bit  in/out  arccos  function  consumes  an 
area  of  19  mm2  in  register-based  implementation  and  0.7  mm2 
SRAM  implementation.  With  the  PWL  approximation,  area 
consumption  reduces  to  46200  pm2  with  register  implemen¬ 
tation  and  9784  /1m2  with  SRAM.  The  adder  and  multiplier 
consume  roughly  1280  pm2  in  total,  which  is  only  a  small 
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TABLE  II 

Storage  Comparison  Examples  Between  a  Direct  LUT  Map  Approach 
and  Fixed-Point  Piece-Wise  Linear  Approximation  Approach 


m 

Direct  LUT  size  LI  (bits) 

Approx.  LUT  size  L2  (bits) 

Improvement 

ratio(Ll/L2) 

10 

10x210 

20x25 

24 

12 

12x212 

24  x26 

25 

14 

14x214 

28  x27 

26 

16 

16x216 

32  x28 

27 

portion  compared  to  the  overall  area  consumption.  Obviously, 
the  PWL  approximation  has  a  large  advantage  in  storage  size 
and  the  advantage  becomes  more  prominent  as  the  input  and 
output  size  increases.  As  for  the  throughput,  because  of  the 
short  operands  and  LUT  address,  the  whole  chain  of  operations: 
LUT,  add  and  multiply  can  be  easily  pipelined  into  a  few  stages 
depending  on  the  process  and  throughput  requirement.  For  ex¬ 
ample,  with  a  45  nm  SOI  process,  we  use  two  pipeline  stages: 
table  lookup,  adder  in  the  first  pipeline  stage  and  multiply  in  the 
second  pipeline  stage,  and  this  structure  can  sustain  roughly  a 
2-GSamples/s  throughput  to  compute  a  15-bit  input  and  output 
nonlinear  function. 

As  a  side  note,  an  alternative  way  to  write  our  formulation  (2) 
is 

Vi  =  hi-  x2  +  {-hSi  ■  1  +  bi  •  1)  m  ki  •  x2  +  c*.  (5) 

To  compare  the  two  formulations,  we  consider  the  following 
two  aspects:  storage  size  and  arithmetic  computation  com¬ 
plexity.  In  terms  of  storage  size,  formulation  (2)  requires 
(mi+m2  +  m2  +  ls) -2mi  =  {2m2Jrmi+ls)'2rni  bits  while 
(5)  requires  (mi  +  m2  +  m2)  •  2mi  =  (2m2  +  mi)  •  2mi  bits. 
Formulation  (2)  does  require  a  little  bit  more  storage  of  ls  •  2mi 
bits,  however,  it  brings  the  advantage  of  shorter  operands  of 
the  add  operation.  In  terms  of  arithmetic  operation  complexity, 
formulation  (2)  requires  an  adder  with  m2  +  ls  and  m2 -bit 
operands,  an  multiplier  with  m2  +  ls  and  m2 -bit  operands, 
while  (5)  requires  an  m-bit  full  adder  and  m2 -bit  multiplier.  As 
m  gets  large,  the  long  adder  in  (5)  may  need  further  pipelining 
and  complicates  the  design  at  high  throughput.  Furthermore,  the 
optimization  lets  bi  represent  the  first  mi  bits  while  it  chooses 
ki  and  Si  in  (2)  so  that  ki(x2  —  Si)  exactly  represent  the  rest  of 
the  m2  bits,  to  avoid  any  overflow  and  an  additional  adder.  Our 
design  is  more  throughput  rather  than  area-limited,  therefore 
with  the  above  considerations,  we  choose  to  use  formulation  (2) 
to  achieve  a  higher  throughput  with  more  compact  arithmetic 
hardware. 

B.  Piece-Wise-Linear  Design  Example 

In  this  section,  we  show  an  example  of  computing  a 
normalized  16-bit  input,  16-bit  output  arccosine  function 
y  =  arccos(x)/(27r)  using  the  proposed  PWL  approximation 
approach.  This  function  is  one  of  the  functions  in  the  actual 
AMO  SCS  design. 

First,  we  obtain  a  floating-point  representation  of  the 
PWL  approximation  through  the  following  least-square 
minimization: 


min  ||  Ax  —  /3\\2,  where 

X 


A  = 


x  = 


P  = 


1,  1, 

0  1 

N2  ’  N2  ' 


b[ 

k[ 


y  0,0 
2/0,1 

L  2/0 ,  AT  —  1 


1 

N  —  l 
N2  J  ATx2 


UN- 1 

kN-l\2xN 
UN-1,0 

Un-  1,1 


VN-1,N-1  -I  TVx N 


(6) 


Here,  N  =  8,  half  of  the  number  of  input  bits;  yij  =  y([hj\)  — 
arccos((2iVz  +  j)/22iV)/(27r),  i,  j  =  0, 1, ...  N  —  1,  and  i  acts 
as  the  address  for  the  LUT.  The  optimal  floating-point  parame¬ 
ters  br ,  kr  yield  a  maximum  absolute  error  <  2“ 16  for  the  input 
ranges  £  [0,  0.963].  For  input  a;  £  (0.963, 1],  the  PWL  approx¬ 
imation  does  not  behave  as  well  because  of  the  large  derivative 
value  when  the  input  approaches  1 .  However,  this  case  only  hap¬ 
pens  when  the  input  sample  vector  nearly  aligns  with  the  two 
decomposed  vectors,  namely  A  is  approaching  a\  +  a2  and  ai, 
ol2  — >  0.  One  solution  is  to  redefine  the  threshold  values  such 
that  those  samples  use  a  set  of  higher  level  of  power  supplies  so 
as  to  avoid  the  situations  of  an,  a2  —>  0. 

Then,  we  quantize  the  terms  br  and  kr  to  8  bits,  and  use  (4) 
to  obtain  the  offset  S.  It  turns  out  that  the  offset  parameter  uses 
11  bits.  And  the  resulting  accuracy  after  all  the  quantization  is 
<  2-15  in  terms  of  maximum  absolute  error. 

Table  III  shows  the  place  and  route  results  of  the  hardware 
implementation  with  the  proposed  approximation  approach,  as 
well  as  other  approaches  as  comparisons.  There  are  two  versions 
of  the  approximation  approach  shown  there  with  different  ways 
of  handling  the  LUT:  one  version  has  the  LUT  programmable 
and  the  other  version  has  it  hardwired.  The  approaches  shown 
there  as  comparisons  include  CORDIC  and  a  6th  order  poly¬ 
nomial  approximation.  CORDIC  [22]  is  a  general  iterative  ap¬ 
proach  to  implement  the  trigonometric  functions.  However,  due 
to  its  general  purpose,  it  is  much  less  energy-efficient  and  with 
lower  throughput  compared  to  our  PWL  approximation.  The 
polynomial  approximation,  as  another  alternative  to  approxi¬ 
mate  the  nonlinear  functions,  requires  much  more  multipliers 
than  the  PWL  approximation,  hence  is  also  less  energy-effi¬ 
cient.  As  a  summary,  the  proposed  PWL  approximation  pro¬ 
vides  6-20  x  improvement  in  energy-efficiency  with  significant 
area  savings  over  the  competing  approaches. 


IV.  Chip  Implementation 
A.  Overall  Chip  Design 

The  baseband  design  uses  the  64-QAM  modulation  scheme 
and  has  the  target  symbol  throughput  of  1-2  GSym/s.  The 
system  has  an  oversampling  rate  of  4  or  2,  resulting  in  a  system 
sample  throughput  of  4  GSam/s.  The  baseband  needs  to  provide 
at  —60  dB  adjacent  channel  power  ratio  (ACPR).  In  order  to 
meet  this  specification  while  overcoming  the  nonlinearity  in 
the  phase  modulator  DAC  [18],  the  baseband  is  designed  to 
achieve  —65  dB  ACPR  with  12-bit  phase  quantization. 

The  baseband  system  has  a  block  diagram  as  shown  in  Fig.  4. 
It  includes  two  parts  of  the  design:  supporting  blocks  and 
AMO  SCS.  The  supporting  blocks  upsample  and  pulse-shape 
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TABLE  III 

Comparison  Between  PWL,  CORDIC  Implementations  of  the  16-bit  Input,  Output  Function  y(x)  —  cos~1(x) 


Minimal 

clock 

period(ps) 

Power  consumption 

(mW)  (post-extraction 
simulation) 

Area  (pm  xpm),  Den¬ 
sity  (%) 

Energy  per  operation 
(pj/op) 

Proposed  PWL  (hardwired  LUT) 

792 

3.24  (at  1GHz) 

80  x  60,  80% 

3.24 

Proposed  PWL  (programmable  LUT) 

856 

7.23  (at  1GHz) 

250  x  240,  77.5% 

7.23 

Unrolled  radix-4  CORDIC 

2600 

63.1  (at  400MHz) 

220x200,  81.4% 

157.75 

6th  order  polynomial 

250 

42  (at  1GHz) 

200x200,70% 

42 
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Fig.  4.  The  block  diagram  of  the  chip. 


the  input  symbol  sequence  from  the  64-QAM  constellation  to 
appropriate  sample  sequences,  which  are  then  fed  to  the  AMO 
SCS  blocks.  Shown  in  Fig.  4,  the  3-bit  I  and  Q  symbols  first 
pass  through  a  LUT-based  nonlinear  predistorter  with  a  size 
of  (210)  x  24  and  produce  I/Q  symbols  with  12-bit  accuracy 
in  each  dimension.  The  system  is  not  designed  to  have  a 
powerful  nonlinear  predistorter,  so  this  simple  predistortion 
table  is  added  only  for  preliminary  symbol-space  predistortion. 
The  table  size  is  chosen  such  that  the  predistorter  has  some 
memory  while  fitting  in  the  die  area.  Then  the  12-bit  I  and  Q 
symbols  pass  through  a  pulse  shaping  filter  which  oversamples 
the  symbols  and  produces  12-bit  I  and  Q  samples  with  shaped 
spectrum.  Interleaving  is  explored  here  to  achieve  even  higher 
throughput.  The  shaping  filter  produces  one  sample  at  the 
positive  edge  of  the  clock  and  another  at  the  negative  edge. 
Therefore,  two  copies  of  the  AMO  SCS  blocks  follow  the  even 
and  odd  outputs  of  the  filter. 

The  AMO  SCS  part,  the  zoomed-in  part  in  the  bottom  of 
Fig.  4,  consists  of  four  main  sub-blocks:  the  Cartesian-to-polar 
block,  Amplitude- selection  block,  Outphasing-angle-computa- 
tion  block,  and  the  angle  function  f(p)  block.  The  Cartesian-to- 
polar  block  computes  the  amplitude  square  and  the  angle  of 
the  I/Q  samples  in  polar  coordinates,  corresponding  to  equation 
(amol)  in  Table  I. 


The  Amplitude-selection  block  then  takes  the  value  of  ampli¬ 
tude  square  and  selects  the  pair  of  power  supplies  for  the  PAs 
in  the  two  paths.  Recall  that  the  initial  motivation  to  modify 
the  LINC  architecture  to  the  AMO  architecture  is  to  introduce 
more  supply  levels  to  minimize  the  combiner  loss  especially 
when  the  outphasing  angle  is  large.  Therefore,  the  choice  of 
the  power  supplies  directly  affects  the  average  power  efficiency. 
According  to  the  Wilkinson  combiner’s  efficiency  [9]  at  sample 
amplitude  A  and  two  PA’s  supply  voltages  ,  a j 


Vc(A,  di,  a,j) 


A  V/2(^)2\ 

Ei±£i)  J  {  a?  +  a]  J  ’ 


(7) 


we  design  the  criterion  shown  in  Table  IV  to  select  the  pair  of 
power  supplies,  where 


[thi,th2,  •  •  •  ,th7)=[{2V{f,  (U  +  E2)2,  (2E2)2,  (E2  +  E3)2, 

(2E3)2,  (E3  +  E4)2,  (2E4)2]  ,  (8) 

and  Vi  <  V2  <  V3  <  V4  are  the  four  available  power  supply 
levels.  The  criterion  is  designed  to  maximize  the  combiner’s  ef¬ 
ficiency  (7)  by  using  the  smallest  pair  of  power  supplies  while 
still  the  power  levels  are  large  enough  to  form  the  transmitted 
sample.  Obviously,  there  are  more  than  the  7  levels  used  here 


LI  et  al. :  HIGH-THROUGHPUT  SIGNAL  COMPONENT  SEPARATOR  FOR  ASYMMETRIC  MULTI-LEVEL  OUTPHASING  POWER  AMPLIFIERS 


375 


Fig.  5.  The  hardware  block  diagram  of  the  SCS  system. 


TABLE  IV 

Criterion  for  Power  Supply  Pair  Selection.  (A2  =  I2  +  Q2) 


a\ ,  a2 

Criterion 

Vi,  Vi 

A2  <  th\ 

Vi,  v2 

A 

IA 

5* 

v2,  v2 

th2  <  A2  <  th3 

v2,v3 

th3  <  A2  <  th4 

v3,  v3 

th4  <  A2  <  th3 

v3,v4 

th3  <  A2  <  thb 

v4,v4 

thb  <  A2  <  th2 

that  can  be  designed  from  4  supply  levels.  An  important  factor 
that  motivates  the  choice  of  the  7  levels  is  the  consideration 
of  minimizing  the  number  of  switching  events  with  each  of 
the  power  supply.  Power  supply  switching  is  accompanied  by 
ringing  and  slewing,  which  introduce  nonlinear  and  memory 
effects  into  the  system  and  cause  the  spectrum  outgrowth  and 
degradation  in  the  linearity  performance  of  the  overall  trans¬ 
mitter.  The  rules  in  (8)  make  only  one  adjacent  power  supply 
change  when  the  sample  amplitude  jumps  from  one  region  to  an 
adjacent  region.  This  is  what  happens  most  of  the  time  because 
the  pulse-shaping  filter  smooths  the  I/Q  symbol  transitions  and 
limits  the  jumps  between  I/Q  samples. 

The  Outphasing-angle-computation  block  computes  the  two 
angles  between  the  decomposed  and  transmitted  vectors,  corre¬ 
sponding  to  equations  (amo2)  and  (amo3)  in  Table  I.  The  steps 
of  the  computations  are  divided  into  four  sub-blocks  in  Fig.  4. 
Sub-blocks  I  and  II  compute  the  argument  of  the  arccosine  func¬ 
tion  ( A 2  +  a2  —  Oj)/(2AcLi),  including  square-root,  inverse  of 
square-root  and  summation  operations.  The  terms  1/2 a;  and 
(a2  —  a2  )/(2di )  in  sub-block  II  are  two  programmable  constants 
and  selected  after  the  determination  of  two  supply  levels.  Then 
sub-block  III  computes  the  arccosine  function  and  IV  computes 
the  final  outphasing  angles. 

The  last  block  of  f(p)  computation  prepares  the  input  signals 
for  the  phase  modulator  we  use,  which  take  the  form  of  1/(1  + 
tan(p)).  The  LUT  used  in  this  block  can  also  be  programmed  to 
compensate  the  static  nonlinearity  of  the  phase  modulator  DAC. 

As  a  summary,  Table  V  lists  the  arithmetic  operations  for  each 
functional  block. 

B.  SCS  Blocks  Design 

In  this  section,  we  show  details  of  the  micro -architecture  of 
each  block  in  the  SCS  system.  Fig.  5  shows  the  overall  pipelined 


TABLE  V 

Summary  of  Arithmetic  Operations  in  Each 
Functional  Block  of  the  AMO  SCS 


Functional  block 

Arithmetic  operations 

Cartesian-to-polar 

multiply,  division,  arctan 

Amplitude  selection 

Comparator 

SUB_BLK  I 

square-root,  inversion  of  square-root 

Outphasing  angles 

SUB_BLK  II 

multiply,  add 

SUB_BLK  III 

arccos 

SUB_BLK  IV 

add 

/( cp)  block 

1 

1  +/an(<p) 

hardware  block  diagram.  It  is  roughly  a  direct  translation  from 
the  conceptual  block  diagram  in  Fig.  4.  The  I/Q  samples  gener¬ 
ated  by  the  shaping  filter  first  pass  through  the  getTheta  block 
and  produce  the  0  and  |/|,  |Q|.  The  following  getAlpha  block 
then  takes  \I\  and  \  Q\,  selects  the  two  power  supplies  and  com¬ 
putes  the  angles  ol\  and  «2  •  This  roughly  corresponds  to  the  Am¬ 
plitude-selection  and  Outphasing-angle-computation  blocks  in 
Fig.  4.  The  angles  ol\  and  a 2,  together  with  0 ,  are  inputs  to  the 
getPhi  block,  which  computes  the  function  1/(1  +  tan(p))  on 
the  outphasing  angles  p\ ,p2-  This  represents  the  f(p)  block 
in  Fig.  4.  The  final  outputs  of  the  SCS  system  are  fp  1,  fp 2, 
quad\ ,  quad2 ,  and  ai,  a2.  Here,  quad\  and  quad2  are  quadrant 
indicators  of  p\  and  P2,  respectively;  fp  1,  fp2  are  computed 
with  pi  P2  converted  to  the  first  quadrant;  a\  and  02  are  the 
digital  codes  that  control  the  PA  power  supply  switches.  Next, 
we  see  how  each  sub-block  accomplishes  its  tasks. 

1)  getTheta  Block:  Fig.  6(a)  shows  the  micro-architecture  of 
the  getTheta  block,  which  has  two  main  operations  as  division 
and  arctan.  With  the  PWL  approximation  algorithm  discussed  in 
Section  III- A,  both  functions  can  be  realized  with  the  micro-ar¬ 
chitecture  in  Fig.  3.  Before  applying  the  approximation,  it  is 
important  to  carefully  examine  the  input  and  output  range  of 
the  function,  because  of  the  nature  of  the  fixed-point  computa¬ 
tion.  In  order  to  have  a  good  accuracy  with  the  approximation, 
it  is  desirable  to  have  an  input  range  where  the  function  behaves 
smoothly  and  has  a  nicely  bounded  derivative.  Consider  as  an 
example  the  division  function.  The  division  function  Q/I  has 
two  input  variables,  while  the  presented  algorithm  assumes  a 
single  variable  function.  So  the  computation  of  Q /I  is  divided 
into  1  //,  followed  by  Q  x  ( 1  /I) .  The  inversion  function  1/7  has 
a  discontinuity  at  I  =  0  and  its  derivative  —1/ 12  becomes  large 
as  \I\  approaching  zero.  In  order  to  use  the  PWL  approximation 
with  good  accuracy,  several  preprocessing  steps  are  necessary 
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Fig.  6.  (a)  The  hardware  block  diagram  of  the  getTheta  block,  (a)  The  hardware  block  diagram  of  the  getPhi  block. 


to  massage  the  input  before  doing  the  approximation  of  the  in¬ 
version  function  1/1.  We  implement  the  following  treatments 
on  the  input,  corresponding  to  the  divPrep  block  in  Fig.  6(a): 

•  Step  (1):  (/,  Q )  are  first  transformed  to  the  first  quadrant 
as  (/',  Q ')  where  /'  =  \I\  and  Q'  =  |Q|.  Use  a  flag  of 
two  bits  to  indicate  whether  the  current  sample  (/,  Q)  is 
actually  negative  or  not. 

•  Step  (2):  Swap  I'  and  Q'  if  Q'  >  I ' ,  so  the  resulting 
(/",  Q")  satisfies  Q"  jl"  G  (0, 1).  The  boundary  values 
of  0  and  1  are  computed  as  special  cases  separately.  Again, 
use  a  flag  to  indicate  whether  the  swap  is  performed  on  the 
current  sample. 

•  Step  (3):  Shift  the  input  I "  such  that  I "  G  (1,2).  The 
shift  operation  is  always  valid  because  the  shaping  filter 
coefficients  are  programmable  and  can  be  designed  such 
that  /,  Q  G  [0, 1].  This  step  just  means  shifting  the  bits  in 
I"  to  the  left  until  the  MSB  is  1 .  Record  the  shifted  number 
of  bits  for  each  sample  I" . 

Although  it  is  obvious  that  after  the  transformations,  Q" /I" 
is  different  from  the  desired  output  Q/I,  these  preprocessing 
steps  can  be  compensated.  Specifically,  the  swap  in  Step  (2) 
and  the  absolute  operation  in  Step  (1)  are  taken  care  of  after  the 
computation  of  (9;  and  the  shift  operation  in  Step  (3)  are  taken 
care  of  after  the  computation  of  Q"  x  (1/1"). 

•  Step  (1):  Shift  back  accordingly  after  the  computation  of 
Q"  x  (1  /I").  This  is  an  operation  included  in  the  block  of 
divPost ,  together  with  the  multiplication  Q"  x  (1  /I"). 

•  Step  (2):  After  the  computation  of  O',  for  values  whose  flag 
indicating  a  swap  operation  has  happened,  0  =  7t/2  —  O', 
otherwise  0  =  0'.  This  is  included  in  the  atanPost  block  in 
Fig.  6(a). 


•  Step  (3):  After  Step  (2),  we  need  to  check  further  if  quad¬ 
rant  change  has  happened  to  the  current  sample,  and  adjust 
the  0  accordingly.  This  is  also  a  part  of  atanPost  block. 

With  properly  designed  preprocessing,  the  input  of  inversion 
function  1/x  takes  the  range  of  (1,  2),  and  the  input  of  function 
arctan(x)  takes  the  range  of  (0,  1).  In  these  ranges,  the  func¬ 
tions  have  nicely  bounded  derivatives,  enabling  them  to  be  suit¬ 
able  for  the  fixed-point  PWL  approximation.  The  two  function’s 
approximation  computations  are  represented  by  the  blocks  di¬ 
vApprox  and  atanApprox  in  Fig.  6(a),  whose  micro-architecture 
follows  the  one  in  Fig.  3(a).  The  overall  getTheta  block  is  able 
to  achieve  a  throughput  of  2  GSamples/s  in  the  place  and  route 
timing  analysis.  The  look-up  tables  that  store  the  b,  S,  and  k 
for  the  two  functions  have  sizes  as  summarized  in  the  first  two 
lines  in  Table  VI.  The  table  also  gives  a  size  comparison  to  the 
LUTs  which  are  used  directly  to  map  the  nonlinear  functions. 
There,  we  can  see  orders  of  magnitude  of  LUT  size  saved  by 
using  our  fixed-point  PWL  approximation  approach.  The  ac¬ 
curacy  column  also  shows  that  an  output  accuracy  of  14  bit  is 
achieved. 

2)  getAlpha  Block:  Fig.  7  demonstrates  the  detailed  micro¬ 
architecture  of  the  getAlpha  block  of  Fig.  5,  also  corresponding 
to  the  conceptual  sub-blocks  I,  II  and  III  of  the  Outphasing- 
angle-computation  part  in  Fig.  4.  The  a%  and  a 2  computations 
include  two  parts:  obtain  the  argument  to  the  arccos  function 
and  calculate  the  arccos  function  itself.  In  order  to  obtain  the 
argument  (a?  +  A2  —  a^)/{2Aai),  we  rearrange  the  terms  as 


l_  /P  _  |  j 

=  c,A  +  c2j,  and  =1  = 


9  9 

af  -  aj 


(9) 
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TABLE  VI 

Summary  of  Accuracy  and  LUT  Size  of  the 
PWL  Approximated  Function  Blocks 


max  |error| 

PWL 

LUT  size 

Direct 

LUT  size 

Improvement 

ratio 

l/x 

7e-5 

30  x27 

15  x212 

4 

arc  tan  (x) 

6e-5 

25  x27 

15  x215 

128 

2.3e-5 

30  x27 

12  x219 

1638 

1/V* 

8.2e-5 

30  x27 

12  x219 

1638 

arccos(x) 

2.4e-5 

30  x27 

15  x215 

128 

1/(1  + 

tan(x)) 

1.6e-5 

26  x27 

10  x  215 

100 

where  constants  c\  and  C2  are  programmable  values  and  are  se¬ 
lected  according  to  the  selection  of  power  supplies.  The  problem 
with  using  the  original  formula  (a2  +  A2  —  a2)/(2AaQ  is  the 
long-bit  division,  whose  inputs  are  on  the  same  order  of  A2 .  On 
the  other  hand,  (9)  involves  no  computations  with  inputs  on  the 
order  of  A2. 

The  computations  to  obtain  the  terms  A,  1/A  in  (9)  include 
approximations  of  the  functions  yjx  and  1  /\fx,  whose  inputs 
are  the  sum  of  \I\2  and  \Q2\.  Similarly  as  we  discussed  for  the 
division  computation,  certain  input  preprocessing  is  necessary 
to  avoid  the  large  derivatives  near  discontinuity  point  at  0.  The 
SqrtPrep  block  of  Fig.  7  serves  this  purpose  by  scaling  the  input 
to  the  range  of  [1/4,  1),  namely  shifting  two  bits  at  a  time  either 
to  the  left  or  right  until  the  input  fits  to  the  range.  Then  the  ap¬ 
proximations  to  the  two  functions  are  performed  and  followed 
by  the  postprocessing  parts  that  compensate  for  the  shifting  op¬ 
erations  done  to  the  inputs.  With  two  more  multipliers  and  one 
adder,  the  computations  of  (9)  are  now  accomplished.  Then  the 
function  arccos(x)  takes  the  input  arguments  and  obtain  angles 
a i ,  G2,  which  is  already  shown  in  the  previous  example.  For  the 
three  functions,  The  LUT  sizes  and  accuracy  for  the  three  func¬ 
tions  are  summarized  in  Table  VI. 

3)  getPhi  Block:  Shown  in  Fig.  6(b)  and  as  the  final  block 
in  Fig.  5,  getPhi  takes  the  outputs  ol\,  ol2  and  0  from  the  pre¬ 
vious  getAlpha  and  getTheta  blocks  and  produces  the  final  out- 
phasing  angles  fpi  and  fp 2.  The  getPhi  block  first  computes 
the  outphasing  angles  pi,  P2  in  the  sub-block  ftanPrep ,  then 
1/(1  +  tan(p))  block  computes  the  final  outputs.  Nominally, 


the  digital  baseband  SCS’s  tasks  end  after  the  ftanPrep ,  deliv¬ 
ering  the  outphasing  angles  themselves.  However,  there  may  be 
additional  signal  processing  task  at  the  interface  between  the 
digital  baseband  and  the  DRFPC  phase  modulator.  In  our  case, 
the  phase  modulator  we  intend  to  use  requires  such  a  function 
on  the  outphasing  angle  as  input. 

After  obtaining  the  outphasing  angles  as  pi  —  0  —  a  1  and 
p>2  =  0+a 2 ,  we  convert  them  to  the  first  quadrants  and  use  2 -bit 
flags  quadi  and  quad2  to  indicate  the  quadrants.  This  conver¬ 
sion  is  necessary  both  for  the  sake  of  the  phase  modulator  input 
requirement,  as  well  as  acting  as  a  preprocessing  step  for  the 
following  functional  approximation.  By  limiting  the  input  to  the 
first  quadrant,  the  function  1/(1  +  tan(p))  has  nicely  bounded 
derivative  as  —1/(1  +  sin( 2p))  in  the  range  of  [0, 7r/2].  Other¬ 
wise,  the  function  has  a  discontinuity  at  37r/4.  So  it  is  suitable 
to  apply  the  PWL  approximation  on  this  function  as  well.  The 
hardware  cost  in  terms  of  the  LUT  size  is  again  summarized  in 
Table  VI. 

C.  Experimental  Results 

With  all  nonlinear  functions  properly  approximated  and  pa¬ 
rameters  quantized,  the  tested  SCS  output  produces  the  signal 
spectrum  as  shown  in  Fig.  8(a).  Compared  with  the  spectrum 
at  the  shaping  filter’s  output,  the  SCS  block  reduces  the  ACPR 
by  2  dB,  from  67  dB  to  65  dB,  due  to  the  approximation  and 
quantization  errors.  Fig.  8(b)  shows  the  64  QAM  constellation 
diagram  between  SCS  output  and  ideal  input,  illustrating  that 
the  SCS  introduces  EVM  of  0.08%. 

The  digital  AMO  SCS  system  is  fabricated  in  a  45  nm  SOI 
process,  with  448578  gates  occupying  the  area  of  1.56  mm2. 
The  chip  runs  up  to  1 .7  GHz  (3.4  Gsample/s)  at  1 . 1  V  supply.  As 
shown  in  the  shmoo  plot  of  Fig.  9,  lowering  the  power  supply 
voltage  decreases  the  dynamic  power  of  the  SCS  digital  system 
until  it  hits  the  minimum-energy  point  at  lower  throughput, 
where  leakage  energy  takes  over.  The  minimum-energy  point 
of  58  pJ  per  sample  or  19  pJ  per  bit  in  64-QAM  transmission 
(assuming  2x  oversampling)  is  measured  at  800  MSamples/s 
throughput.  For  typical  PA  efficiency  of  40%  and  throughput 
of  800  MSamples/s,  at  peak  output  power  level  of  1.8  W,  the 
total  peak  PAE  is  affected  by  less  than  1%  (46  mW/(46  mW  + 
1.8  W/0.4))  by  this  64-QAM  capable  AMO  SCS  backend. 
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Fig.  10.  Chip  photograph. 
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Fig.  8.  Spectrum  and  EVM  of  the  SCS.  (a)  Spectrum  comparison  of  the  SCS 
output  and  shaping  filter  output,  (b)  EVM  comparison  of  the  SCS  output  and 
ideal  input. 
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Fig.  9.  Throughput  and  energy  with  supply  scaling  for  AMO  SCS. 


The  chip  photograph  is  shown  in  Fig.  10,  with  annotated 
blocks  and  sizes.  The  power  breakdown  of  the  AMO  SCS  is 
illustrated  in  Fig.  11(a).  Based  on  the  reported  post-place  and 
route  power  estimation  values,  the  estimated  contribution  to 
the  total  AMO  SCS  power  at  2  GHz  operation  is  shown.  The 
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Fig.  11.  (a)  Power  breakdown  of  the  AMO  SCS  design,  (b)  Area  breakdown 
of  the  AMO  SCS  design. 


large  proportion  of  the  clocking  power  is  in  part  due  to  the  la¬ 
tency-matching  register  stages  on  amplitude  paths  required  to 
compensate  for  the  depth  of  the  phase  computations,  and  the 
leakage  power  of  the  getPhi  block  is  due  to  its  programmable 
LUT  of  the  f(ip)  function.  The  area  breakdown  of  the  AMO 
SCS  is  illustrated  in  Fig.  11(b),  which  shows  the  areas  of  major 
functional  blocks  of  the  three  main  functions  of  the  SCS.  The 
computation  of  the  function  of  f(ip)  takes  over  two  thirds  of  the 
area  due  to  its  programmable  LUTs.  A  comparison  of  our  work 
with  other  digital/analog  implementations  of  LINC/AMO  SCS 
is  summarized  in  the  first  5  columns  of  the  Table  VII.  Our  work 
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TABLE  VII 

Comparison  With  Other  Works 


This  work 

[13] 

[23] 

[19] 

[15] 

[15] 

[15] 

[15] 

Analog/Digital 

Digital 

Analog 

Analog 

Digital 

Digital 

Digital 

Digital 

Digital 

Functionality 

AMO 

LINC 

LINC 

LINC 

AMO 

AMO 

AMO 

AMO 

Technology 

45nm 

0.25,1/m 

0.35pm 

90nm 

90nm 

90nm 

Scaled 

Scaled 

SOI 

CMOS 

CMOS 

CMOS 

CMOS 

CMOS 

CMOS 

to  45nm 
CMOS 

to  45nm 
CMOS 

Throughput 

3.4GSam/s, 

0.8GSam/s 

20MSam/s 

1.5MSam/s 

50MSam/s 

40MSam/s 

40MSam/s 

40MSam/s 

Scaled  to 
0.8GSam/s 

Phase 

Resolution 

12-bit 

N/A 

N/A 

8-bit 

8-bit 

Scaled  to 
12-bit 

Scaled  to 
12-bit 

Scaled  to 
12-bit 

Power 

323mW, 

46mW 

45  mW 

80mW 

0.95mW 

0.36mW 

8.64mW 

4.32mW 

86.4mW 

Energy/Sample 

95pJ/Sam, 

58pJ/Sam 

2250pJ/Sam 

5333pJ/Sam 

1 9pJ/Sam 

8.9pJ/Sam 

212pJ/Sam 

106pJ/Sam 

106pJ/Sam 

Area 

1.5mm2 

0.1mm2 

0.61mm2 

0.06mm2 

0.34mm2 

8.16mm2 

2.04mm2 

40.8mm2 

demonstrates  a  design  with  the  highest  throughput  and  phase 
accuracy  to  date.  To  show  a  more  fair  comparison  with  other 
digital  AMO  SCS  work,  we  scaled  the  design  in  [15]  to  provide 
the  same  phase  accuracy,  technology  node  and  throughput.  The 
scaled  performances  are  summarized  in  the  last  3  columns  of  the 
Table  VII,  and  our  design  shows  more  than  2x  improvement 
in  energy-efficiency  and  25  x  improvement  in  area.  As  a  gen¬ 
eral  guideline,  for  applications  with  low/medium  accuracy  (e.g. 
less  than  8-bit  phase  resolution)  requirement  and  low/medium 
throughput  (e.g.  up  to  hundreds  of  MSamples/s),  LUT  is  still 
a  good  design  choice  because  of  its  low  energy-efficiency,  rea¬ 
sonable  size  and  low  design  complexity.  On  the  other  hand,  our 
proposed  approach  is  more  suitable  for  applications  with  high 
accuracy  (e.g.  greater  than  10-bit  phase  resolution)  and  high 
throughput  (e.g.  around  GSamples/s)  requirements. 

V.  Conclusion 

In  this  paper,  we  present  a  chip  design  of  a  high-throughput 
(3.4  GSamples/s)  SCS  for  the  AMO  PA  architecture.  In  order 
to  achieve  energy-  and  area-efficient  high-throughput  opera¬ 
tion,  we  developed  a  new  fixed-point  piece-wise  linear  approx¬ 
imation  algorithm  for  the  computations  of  the  nonlinear  func¬ 
tions  in  SCS  design.  This  new  algorithm  and  the  corresponding 
implementation  achieve  over  2  x  improvement  in  energy  effi¬ 
ciency  and  25  x  improvement  in  area  efficiency  over  the  tra¬ 
ditional  AMO  SCS  implementations.  The  algorithm  has  nice 
properties  of  few  and  simple  arithmetic  operations,  short  arith¬ 
metic  operands  and  small-sized  look-up  tables,  and  can  be  easily 
pipelined  to  run  at  multi-GSamples/s  throughputs.  Designed  in 
45  nm  SOI  technology,  this  SCS  implementation  is  the  fastest 
SCS  implementation  demonstrated  to  date.  Though  we  demon¬ 
strate  the  application  of  the  approximation  algorithm  with  the 
AMO  SCS,  the  approximations  are  directly  applicable  to  LINC 
SCS,  and  enable  a  new  class  of  wideband  wireless  mm-wave 
communication  system  designs  with  high  energy  and  spectral 
efficiency. 
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B.  94-GHz  Transmitter  Design 

In  this  section  we  include  technical  detail  on  the  94-GHz  design  done  in  the  final  installment  of  the  project 
since  this  is  an  unpublished  work. 


NCSU  worked  on  the  following  tasks  during  this  contract:  (a)  design  of  a  94-GHz  outphasing 
transmitter  in  45nm  SOI  CMOS,  (b)  assistance  in  the  design  of  a  45-GHz  outphasing  transmitter  in  45nm 
SOI  CMOS,  (c)  assistance  in  the  design  of  printed  circuit  boards  and  package  transitions,  (d)  assistance 
in  the  measurement  of  the  above  circuits,  and  (e)  exploration  of  alternative  transmitter  architectures.  A 
tape-out  of  a  single-chip  outphasing  transmitter  including  both  45-GHz  and  94-GHz  sections  was 
completed  in  June  2013.  Hardware  was  received  and  measurement  on  functionality  were  completed.  The 
work  was  completed  by  the  principal  investigator,  Dr.  Brian  Floyd,  and  a  PhD  student,  Yi-Shin  Yeh. 


B.l.  Design  of  94-GHz  Outphasing  Transmitter 

A  94-GHz  phase  modulator  circuit  was  designed  to  operate  with  the  MIT  digital  baseband  and  the 
CMU/NCSU  94-GHz  power  amplifier.  A  block  diagram  of  the  transmitter  is  shown  in  Fig.  B 1 .  The  94- 
GHz  analog  transmitter  (TX)  was  broken  up  into  four  functional  blocks,  as  follows:  the  I/Q  DACs 
designed  at  CMU  and  reused  from  the  45-GHz  TX  design;  the  31 -GHz  phase  modulators  designed  jointly 
at  CMU  and  NCSU;  the  30-to-94-GHz  upconversion  block  designed  at  NCSU;  and  the  94-GHz  PAs 
designed  at  NCSU.  A  simplified  block  diagram  of  the  system  is  shown  in  Fig.  Bl,  whereas  Fig.  B2  shows 
a  top-level  layout,  which  has  an  area  of  3  x  2  mm2.  A  two-step  up-conversion  architecture  was  chosen  to 
allow  the  reuse  of  modulator  circuitry  and  thereby  minimize  risk  for  the  program.  The  transmitter  consists 
of  31 -GHz  phase  modulators,  31-to-94GHz  up-conversion  mixers,  94-GHz  outphasing  power  amplifiers 
with  output  power  combiners,  and  a  31/62  GHz  local-oscillator  (LO)  path. 


Stojanovic  Floyd  Team  (NCSU) 


Fig.  B 1 :  Simplified  schematic  of  94GHz  TX. 


Fig.  B2:  Layout  screenshot  of  94GHz  TX. 


B.1.1.  31  GHz  Phase  Modulator 

Fig.  B3  shows  the  schematic  of  the  31 -GHz  single- sideband  modulator  along  with  a  performance 
summary.  The  circuit  modulates  baseband  IQ  currents  onto  31 -GHz  quadrature  LO  signals.  Two  I  and  Q 
single-balanced  mixers  are  used  with  quadrature- selected  switches.  Quadrature  currents  are  summed 
directly  at  the  output.  The  tail  currents  of  each  I/Q  single-balanced  mixer  come  from  the  biasing  current 
mirror  driven  by  I/Q  baseband  DACs.  According  to  simulations,  the  modulator  has  2.7-dB  conversion 
gain  at  -10-dBm  31-GHz  LO  input  with  about  10-GHz  3-dB  LO  bandwidth.  The  3rd  harmonic  rejection 
is  -25  dBc  and  the  circuit  consumes  27  mA  from  a  1.5-V  supply. 


To  Owl  trinunoa 


Fig.  B3:  31 -GHz  single-sideband  modulator  summary. 


B.1.2.  31-to-94  GHz  Mixer 

Fig.  B4  shows  a  schematic  of  the  31-to-94  GHz  up-conversion  mixer.  It  mixes  a  31 -GHz  modulated 
IF  signal  with  62-GHz  LO  signal.  The  pseudo-differential  common-source  transconductance  stage  is 
optimized  for  conversion  gain  and  input  31 -GHz  matching.  The  LO  switching  transistors  are  optimized  to 
have  a  small  overdrive  voltage  for  the  relatively  low  LO  input  swing  and  to  have  good  impedance 
matching  at  the  LO  input.  The  up-conversion  mixer  achieves  more  than  10-GHz  3-dB  IF  input  bandwidth 
(lower  left,  Fig.  4)  and  5-dBm  saturated  output  power  with  -20dBc  harmonic  rejections  (lower-right  Fig. 
4).  After  tape-out,  it  was  discovered  that  the  bias  voltage  for  the  upper  transistors  in  the  mixer  did  not 
include  an  on-chip  bypass  capacitor,  leading  to  some  ringing  within  the  mixer.  Through  collaboration 
with  MIT,  it  was  confirmed  that  the  ringing  issue  within  the  30  to  94GHz  upconversion  mixer  limits  the 
overall  linearity  of  the  transmitter  and  causes  high  error-vector  magnitude.  This  design  bug  would  be  easy 
to  fix  in  subsequent  tape-outs,  provided  there  was  funds  and  time  available  for  that  tape-out. 
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Fig.  B4:  31-  to  94-GHz  up-conversion  mixer. 


B.1.3.  31-to-62  GHz  Frequency  Doubler  with  Active  Output  Power  Splitter: 


Fig.  B6  shows  the  schematic  of  the  31-to-62GHz  self- mixing  frequency  doubler  with  an  active  output 
power  splitter.  The  doubler  is  designed  to  multiply  the  external  31 -GHz  LO  input  signal  up  to  a  62-GHz 
LO  output  signal.  The  cascode  active  output  power  splitter  is  to  divide  the  62-GHz  LO  into  two  paths  for 
outphasing  purpose.  The  inter-stage  matching  of  the  doubler  is  designed  to  minimize  DC  offset  of  self¬ 
mixing.  The  doubler  can  achieve  -4.5dB  conversion  gain  and  6-GHz  3dB  LO  bandwidth  (lower  left,  Fig. 
B5).  A  0-dBm  saturation  output  power  is  obtained  at  62GHz  with  -37.1dBc  4th  harmonic  rejection  (lower 
right,  Fig  5). 
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B.1.4.  62  GHz  Differential  LO  Amplifier 

Fig.  B6.  shows  the  schematic  of  the  62-GHz  differential  LO  amplifier.  The  LO  amplifier  is  designed 
to  provide  large  enough  LO  power  to  drive  the  switching  transistors  of  the  mixer.  A  20-GHz  3dB  LO 
bandwidth  is  achieved  (lower  left,  Fig.  B6)  and  a  good  linear  output  LO  power  is  obtained  (lower  right, 
Fig.  B6)  due  to  adding  degeneration  inductance. 
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B.1.5.  94GHz  Predriver  Amplifier: 


Fig.  B7  shows  the  schematic  of  the  94-GHz  single-ended  RF  amplifier.  The  three-stage  RF  amplifiers 
are  designed  to  achieve  more  than  10-dBm  output  power  to  drive  the  outphasing  PAs  into  saturation.  The 
size  of  the  RF  amplifier  is  scaled  down  from  the  PA  designs.  A  20-GHz  3-dB  RF  bandwidth  is  obtained. 
The  final  stage  of  three-stage  RF  amplifiers  is  also  fully  saturated  to  drive  the  outphasing  PAs. 
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Fig.  B7 :  Pre-driver  94-GHz  Amplifier 


B.1.6.  Top-Level  Power  Management 

Fig.  B8  shows  the  power  management  for  the  interface  of  the  building  blocks.  To  achieve  10-dBm 
output  power  to  drive  the  outphasing  PAs  in  saturation,  more  than  0-dBm  31-GHz  external  LO  input  is 
required.  When  driven  with  this  power,  the  RF  powers  through  the  chain  are  marked  in  Fig.  B8  in  red. 
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Fig.  B8:  Power  management  of  the  TX. 

B.l.  7.  Board  Design  for  System 

The  NCSU  team  assisted  in  the  design  of  the  board  transmission  lines  and  transitions  for  the  31-GHz 
local  oscillator  signal  and  the  94-GHz  output  PA  signal.  Designs  were  completed  in  HFSS  and  included 
the  complete  board-level  transmission  line,  the  solder  ball  transition  to  the  BGA  package,  and  then 
terminated  with  the  on-chip  transition.  Our  team  also  assisted  the  MIT  team  to  evaluate  the  integrity  of 
the  solder-ball  transitions  and  the  ground  connections.  Figure  B9  shows  a  summary  of  the  designed 
structures  from  HFSS  and  Figure  BIO  shows  the  simulated  results.  These  results  are  for  a  two-port 


network  representing  the  entire  board-to-chip  transition,  including  transmission  line  and  GSG  transition. 
Good  matching  is  achieved  on  both  ports  at  94  GHz  and  overall  insertion  loss  is  less  than  2dB.  The  same 
approach  was  also  completed  for  the  31 -GHz  LO  signal.  Here,  the  31GHz  transmission  line  on  the  board 
was  much  longer;  hence,  the  insertion  loss  increased  to  2.6  dB.  Good  matching  was  also  obtained  for  the 
LO  ports  at  3 1  GHz. 


PCB  Board  Design  by  Zhipeng(MIT) 


Front  side-view  for  spherical  solder  bumps 
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Solder  bump  (Tin  )is  assumed  to  be  spherical  structure 
with  lOOum  height  and  150um  pitch. 
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EM  simulations  by  Daniel  Yeh(NCSU) 


94GHz  on-board  signal  trace 


Fig  B9:  94-GHz  printed  circuit  design  in  HFSS 


Fig  BIO:  Simulated  two-port  s-parameters  for  94-GHz  board  to  chip  transition.  Port  1  is  on  the  board  and  port  2  is 
on  the  chip. 


B.2.  Hardware  Evaluation 

The  transmitter  was  fabricated  in  IBM  45nm  SOI  CMOS  technology  and  subsequently  packaged  by 
MIT.  NCSU  received  an  evaluation  board  containing  the  packaged  part  and  worked  on  measurement  of 
the  system.  Our  measurements  were  limited  by  the  lack  of  a  custom-designed  heat  sink  for  the  entire 
system.  Limited  resources  prevented  the  custom  design  of  a  new  heat  sink  for  our  measurements.  Without 
this  heat  sink,  a  complete  powering  up  of  the  system  would  result  in  very  high  on-chip  temperatures  and 
overheating.  Therefore,  we  measured  the  input  and  output  return  losses  for  the  RF  ports  and  also  measured 
the  DC  power  consumption  for  the  internal  supplies  all  while  the  final  PAs  were  not  fully  powered  up. 

First,  the  DC  consumptions  were  measured  for  the  various  supplies,  with  the  results  summarized  in 
Table  1.  Good  agreement  is  obtained  between  measurements  and  simulation  for  the  phase  modulator  and 
the  mixer.  The  DC  current  of  the  PAs,  however,  was  much  higher  than  the  simulated  values,  increasing 
the  chip  temperature  and  exacerbating  the  overheating  condition. 

The  return  losses  were  then  measured  at  the  board  level  for  the  31 -GHz  LO  input  and  the  94-GHz  RF 
output.  These  include  the  complete  board-level  transmission  line,  the  board-to-chip  transition  through  the 
package,  and  then  finally  the  chip  termination.  Figure  B 1 1  shows  a  depiction  of  the  measurement  setup 
for  measuring  the  LO  return  loss.  Figure  B 12  shows  the  comparison  between  measured  S 1 1  and  simulated 
Sll  for  the  LO  port.  As  can  be  seen,  the  optimum  matching  frequency  is  shifted  down.  Return  loss  at 
31GHz  is  about  7dB.  We  are  not  able  to  debug  the  source  of  this  disagreement  between  measurement  and 
simulation.  It  can  easily  be  due  to  manufacturing  variations  on  the  package  and/or  the  board. 


Table  1:  Summary  of  measured  and  simulated  power  consumption  of  the  chip 


At  estimated  90°C 

Measurement 

(4A) 

Measurement 

(4B) 

Simulation 

Digital  BB 
(VDD1V  ) 

~0.46A*1V 

-0.453 A*  IV 

N/A 

PM  and  Mixer 
(YDD2Y) 

~0.685A*2V 

-0.71A*2V 

0.675A*2V 

At  estimated  140°C 

Measurement 

(4A) 

Measurement 

(4B) 

Simulation 

Outphasing  94GHz  PA 
(VDD2V) 

~2.28A*2V 

-2.32A*2V 

1.382A*2V 

TX  wo  BB  total 

-5.93W 

-6.06W 

-4.11W 

TX  with  BB  total 

-6.39W 

-6.5 1W 

NA 
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Fig  B 1 1 :  Measurement  setup  for  return  loss  of  the  3 1  -GHz  LO  port. 
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Fig  B12:  Measured  and  simulated  return  loss  of  31  GHz  LO  port,  board  level. 


The  return  loss  of  the  94-GHz  port  was  also  measured  with  the  94GHz  outphasing  PA  powered  up. 
The  measurement  setup  is  shown  in  Fig.  B13  and  the  measured  and  simulated  results  are  shown  in 
Fig.  B14.  Here,  the  measurements  indicate  once  again  a  disagreement  between  measurements  and 
simulations. 
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Fig  B14:  Measured  and  simulated  return  loss  of  94GHz  PA  port,  board  level. 

B.3.  New  Architecture  for  94-GHz  Transmitter 

During  the  final  months  of  the  program,  we  evaluated  the  potential  of  alternative  94-GHz  architectures. 
A  block  diagram  of  the  new  architecture  is  shown  in  Fig.  B15.  The  system  consists  of  a  direct  conversion 
to  94GHz,  with  quadrature  combining  at  the  output.  This  should  reduce  I/Q  crosstalk  in  the  modulator 
and  improve  IYM.  The  modulation  is  created  through  simple  baseband  DACs  which  are  identical  for  the 
I  and  Q  paths.  Additionally,  no  polyphase  filter  is  required  in  the  LO  path,  simplifying  the  LO  generation 
significantly.  Instead  a  45GHz  LO  signal  is  required,  which  is  then  buffered  and  frequency-doubled. 
Finally,  rather  than  an  outphasing  combiner  being  used  either  on-chip  or  on-the  board,  a  spatial  outphasing 
is  proposed,  eliminating  pulling  issues  within  the  PA.  Although  these  architectures  appear  promising, 
programmatically,  a  decision  was  made  to  not  pursue  a  revised  94-GHz  design  due  to  limited  personnel 
and  funding.  We  do  however  remain  ready  to  realize  this  design  in  hardware,  resources  permitting. 


94G  Modulator 


Fig  B15:  Proposed  new  architecture  for  the  transmitter 


C.  Nonlinear  Predistortion 


In  this  section  we  describe  the  results  from  the  nonlinear  predistortion  research  done  in  this  project  and 
published  in  Y.  Li’s  PhD  thesis,  MIT  2013. 

In  this  section  we  describe  the  linearization  approach  that  we  have  developed  and  verified  through 
extensive  simulations.  We  first  verify  the  existence  of  a  working  compensator  by  use  of  an  off-line 
iterative  solving  scheme.  Table  C2  shows  the  ACPR  and  EVM  improvements  through  iterations  for  the 
example  of  a  LINC  system.  Figs.  C5  and  C6  show  the  EVM  and  ACPR  improvement  before  and  after  the 
off-line  compensation.  We  discovered  the  need  for  a  zero-avoidance  filter  in  replace  of  a  shaping  filter  for 
the  LINC  system  and  a  level-avoidance  filter  for  the  AMO  system,  due  to  the  discontinuities  present  in 
the  LINC/ AMO  systems.  The  zero-avoidance  filter  shapes  the  spectrum  of  the  incoming  symbol  and  keeps 
the  resulting  sample  magnitude  above  certain  threshold  level.  This  type  of  filter  enables  the  success  of 
off-line  iterative  compensation. 


Iteration 

EVM  (%) 

ACPR  (dB) 

Bt 

^2 

9 

0 

4.5 

-30.6 

NA 

NA 

NA 

1 

1.7 

-37.6 

0.145 

0.174 

0.289 

2 

1.1 

-42.4 

0.192 

0.244 

0.435 

3 

1.0 

-44.0 

0.191 

0.228 

0.701 

Table  C2:  ACPR  and  EVM  performances  of  LINC  system  in  off-line  iterations,  with  input  sequence  generated  from  a 

real-time  zero-avoidance  filter. 


(a)  (b) 

Figure  C5:  EVM  of  the  LINC  system,  before  ((a))  and  after  off-line  compensation  ((b)),  with  real-time  zero-avoidance 

input  sequence. 
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Figure  C6:  Input  and  output  ACPR  of  the  LINC  system,  with  real-time  zero-avoidance 
input  sequence,  before  and  after  the  off-line  compensation. 


Floating 

-body  PA 

Body-tied  PA 

Zero-av<  >idance 

No  zero-avoidance 

Zer<  >-av<  tidance  N<  >  zero-av<  >idanc:e 

ACPR(dB) 

-30.6  -5-  -44 

-30.1  -H-  -39.6 

-41.3  -5-  -56.4  -44  -►  N/A 

EVM  (%) 

4.5  -4  1.0 

4.2  1.7 

2.6  -5-  0.15  0.9  -4  N/A 

Table  C3:  ACPR  and  EVM  performance  comparisons  between  using  input  sequence 
with  and  without  zero-avoidance  property  for  LINC  systems. 

Based  on  a  successful  off-line  compensation,  which  ensures  the  existence  of  a  working  compensator,  we 
further  analyze  the  nonlinear  system  structure  and  arrive  at  a  real-time  compensator.  The  real-time 
compensator’s  position  in  the  overall  baseband  is  shown  in  Fig.  C7.  The  structure  is  the  concatenation  of 
a  nonlinear  system  followed  by  LTI  systems,  whose  characteristics  are  shown  in  Fig.  C8. 
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Figure  Cl:  Placement  of  the  compensator  in  the  LINC/ AMO  systems. 


Figure  C8:  Compensator  structure. 


With  this  model  structure,  we  are  able  to  achieve  an  EVM  performance  from  4.5%  to  2.5%,  and  ACPR 
performance  from  -30dB  to  -39dB.  The  corresponding  EVM  and  ACPR  improvements  are  shown  in  Figs. 
C9  and  CIO. 


(a)  (b) 

Figure  C9:  EVM  of  the  LINC  system,  before  ((a))  and  after  real-time  compensation  ((b)). 
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Figure  CIO:  ACPR  of  the  LINC  system  with  real-time  compensator. 


With  a  similar  compensator  structure,  we  are  also  able  to  compensate  the  AMO  system.  With  5pH  bump 
inductance  and  power  switching  levels  confined  to  1.1  V  and  2.2V,  the  EVM  decreases  from  6.7%  to  2.7% 
and  ACPR  from  -27.4dB  to  -36.2dB.  The  comparison  of  the  EVM  with  and  without  the  compensator  is 
shown  in  Fig.  Cl  1  and  the  ACPR  result  is  shown  in  Fig.  Cl 2. 


(a) 


(b) 


Figure  Cll:  EVM  of  the  AMO  system,  before  ((a))  and  after  real-time  compensation  ((b)). 
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Figure  C12:  ACPR  of  the  AMO  with  a  real-time  compensator 


In  SYS3  chip,  we  implemented  the  baseband  to  include  the  AMO  SCS  followed  by  the  compensator  as 
described  above.  The  block  diagram  of  the  compensator  is  shown  in  Fig.  Cl 3.  The  overall  integrated 
transmitter  system  chip  layout  is  shown  in  Fig.  C14.  It  has  an  overall  area  of  6mmx3mm  and  fabricated 
in  45nm  SOI  process. 
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Figure  C13:  Block  diagram  of  the  compensator. 
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Figure  C14:  Overview  of  the  integrated  transmitter  system  with  digital  baseband  nonlinear  compensation. 
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